8  Advanced Performance Metrics

8.1 Overview of Unit

8.1.1 Learning Objectives

  • Understand costs and benefits of accuracy
  • Use of a confusion matrix
  • Understand costs and benefits of other performance metrics
  • The ROC curve and area under the curve
  • Model selection using other performance metrics
  • How to address class imbalance
    • Selection of performance metric
    • Selection of classification threshold
    • Sampling and resampling approaches

8.1.2 Readings

Post questions to the readings channel in Slack

8.1.3 Lecture Videos

Post questions to the video-lectures channel in Slack


8.1.4 Coding Assignment

Post questions to application-assignments Slack channel

Submit the application assignment here and complete the unit quiz by 8 pm on Wednesday, March 13th


8.2 Introduction

In this unit, we will again use the Cleveland heart disease dataset.

However, I have modified this to make the outcome unbalanced such that Yes represents approximately 10% of the observations

Now that we will calculate performance metrics beyond accuracy, the order of the levels of our outcome variable(disease) matters. We will make sure that the positive class (event of interest; in this case yes for disease) is the first level.


First, lets open and skim the raw data

data_all <- read_csv(here::here(path_data, "cleveland_unbalanced.csv"), 
                     col_names = FALSE, na = "?",col_types = cols()) |> 
  rename(age = X1,
         sex = X2,
         cp = X3,
         rest_bp = X4,
         chol = X5,
         fbs = X6,
         rest_ecg = X7,
         exer_max_hr = X8,
         exer_ang = X9,
         exer_st_depress = X10,
         exer_st_slope = X11,
         ca = X12,
         thal = X13,
         disease = X14) 

data_all |> skim_some()
Data summary
Name data_all
Number of rows 1281
Number of columns 14
_______________________
Column type frequency:
numeric 14
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate p0 p100
age 0 1.00 29 77.0
sex 0 1.00 0 1.0
cp 0 1.00 1 4.0
rest_bp 0 1.00 94 200.0
chol 0 1.00 126 564.0
fbs 0 1.00 0 1.0
rest_ecg 0 1.00 0 2.0
exer_max_hr 0 1.00 71 202.0
exer_ang 0 1.00 0 1.0
exer_st_depress 0 1.00 0 6.2
exer_st_slope 0 1.00 1 3.0
ca 22 0.98 0 3.0
thal 8 0.99 3 7.0
disease 0 1.00 0 4.0

Code categorical variables as factors with meaningful text labels (and no spaces)

  • NOTE the use of disease = fct_relevel (disease, "yes") to set yes as positive class (first level) for disease
data_all <- data_all |> 
  mutate(disease = factor(disease, levels = 0:4, 
                          labels = c("no", "yes", "yes", "yes", "yes")),
         disease = fct_relevel (disease, "yes"),
         sex = factor(sex,  levels = c(0, 1), labels = c("female", "male")),
         fbs = factor(fbs, levels = c(0, 1), labels = c("no", "yes")),
         exer_ang = factor(exer_ang, levels = c(0, 1), labels = c("no", "yes")),
         exer_st_slope = factor(exer_st_slope, levels = 1:3, 
                                labels = c("upslope", "flat", "downslope")),
         cp = factor(cp, levels = 1:4, 
                     labels = c("typ_ang", "atyp_ang", "non_anginal", "non_anginal")),
         rest_ecg = factor(rest_ecg, levels = 0:2, 
                           labels = c("normal", "wave_abn", "ventric_hypertrophy")),
         thal = factor(thal, levels = c(3, 6, 7), 
                       labels = c("normal", "fixeddefect", "reversabledefect")))

data_all |> skim_some()
Data summary
Name data_all
Number of rows 1281
Number of columns 14
_______________________
Column type frequency:
factor 8
numeric 6
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1.00 FALSE 2 mal: 752, fem: 529
cp 0 1.00 FALSE 3 non: 872, aty: 296, typ: 113
fbs 0 1.00 FALSE 2 no: 1104, yes: 177
rest_ecg 0 1.00 FALSE 3 nor: 721, ven: 550, wav: 10
exer_ang 0 1.00 FALSE 2 no: 1044, yes: 237
exer_st_slope 0 1.00 FALSE 3 ups: 778, fla: 434, dow: 69
thal 8 0.99 FALSE 3 nor: 940, rev: 285, fix: 48
disease 0 1.00 FALSE 2 no: 1142, yes: 139

Variable type: numeric

skim_variable n_missing complete_rate p0 p100
age 0 1.00 29 77.0
rest_bp 0 1.00 94 200.0
chol 0 1.00 126 564.0
exer_max_hr 0 1.00 71 202.0
exer_st_depress 0 1.00 0 6.2
ca 22 0.98 0 3.0

Disease is now unbalanced

data_all |> tab(disease)
# A tibble: 2 × 3
  disease     n  prop
  <fct>   <int> <dbl>
1 yes       139 0.109
2 no       1142 0.891

For this example, we will evaluate our final model using a held out test set

set.seed(20140102)
splits_test <- data_all |> 
  initial_split(prop = 2/3, strata = "disease")

data_trn <- splits_test |> 
  analysis()

data_test <- splits_test |> 
  assessment()

We will be fitting a penalized logistic regression again (using glmnet)

We will do only basic feature engineering for this algorithm and to handle missing data

rec <- recipe(disease ~ ., data = data_trn) |> 
  step_impute_median(all_numeric_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |>   
  step_dummy(all_nominal_predictors()) |> 
  step_normalize(all_predictors())

We tune/select best hyperparameter values via bootstrap resampling with the training data

  • get bootstrap splits
  • make grid of hyperparameter values
splits_boot <- data_trn |> 
  bootstraps(times = 100, strata = "disease")  

grid_glmnet <- expand_grid(penalty = exp(seq(-8, 3, length.out = 200)),
                           mixture = seq(0, 1, length.out = 6))

fits_glmnet <- cache_rds(
  expr = {
    logistic_reg(penalty = tune(), 
                 mixture = tune()) |> 
      set_engine("glmnet") |> 
      set_mode("classification") |> 
      tune_grid(preprocessor = rec, 
                resamples = splits_boot, grid = grid_glmnet, 
                metrics = metric_set(accuracy))
  },
  dir = "cache/008/",
  file = "fits_glmnet",
  rerun = rerun_setting)

Review hyperparameter plot and best values for hyperparameters

plot_hyperparameters(fits_glmnet, hp1 = "penalty", hp2 = "mixture", 
                     metric = "accuracy", log_hp1 = TRUE)

show_best(fits_glmnet, n = 1)
# A tibble: 1 × 8
   penalty mixture .metric  .estimator  mean     n std_err
     <dbl>   <dbl> <chr>    <chr>      <dbl> <int>   <dbl>
1 0.000583       1 accuracy binary     0.929   100 0.00108
  .config                
  <chr>                  
1 Preprocessor1_Model1011

Let’s fit this best model configuration to all of our training data and evaluate it in test

rec_prep <- rec |> 
  prep(data_trn)
feat_trn <- rec_prep |> 
  bake(data_trn)

fit_glmnet <-   
  logistic_reg(penalty = select_best(fits_glmnet)$penalty, 
               mixture = select_best(fits_glmnet)$mixture) |> 
  set_engine("glmnet") |> 
  set_mode("classification") |> 
  fit(disease ~ ., data = feat_trn)

And evaluate it by predicting into test

feat_test <- rec_prep |> 
  bake(data_test)

(model_accuracy <- accuracy_vec(feat_test$disease, predict(fit_glmnet, feat_test)$.pred_class))
[1] 0.9299065

Accuracy is an attractive measure because it is:

  • Intuitive and widely understood
  • Naturally extends to multi-class scenarios
  • These are not trivial advantages in research or application

However, accuracy has at least three problems in some situations

  • If the outcome is unbalanced, it can be misleading

    • High performance from simply predicting the majority (vs. minority) class for all observations
    • Need to anchor evaluation of accuracy to baseline performance based on the majority case percentage
  • If the outcome is unbalanced, selecting among model configurations with accuracy can be biased toward configurations that predict the majority class because that will yield high accuracy by itself even without any signal from the predictors

  • Regardless of outcome distribution, it considers false positives (FP) and false negatives (FN) equivalent in their costs

    • This is often not the case

Outcome distributions :

  • May start to be considered unbalanced at ratios of 1:5 (20% of cases in the infrequent class)
  • In many real life applications (e.g., fraud detection), imbalance ratios ranging from 1:1000 up to 1:5000 are not atypical

When working with unbalanced datasets:

  • The class or classes with many observations are called the major or majority class(es)
  • The class with few observations (and there is typically just one) is called the minor or minority class.

In our example, the majority class is negative (no) for heart disease and the minority class is positive (yes) for heart disease


Question: In our example, our model’s accuracy in test seemed high but was it really performing as well as this seems?
Show Answer
Although this test accuracy seems high, this is somewhat misleading.  A model that 
simply labeled everyone as negative for heart disease would achieve almost as high 
accuracy in our test data

(model_accuracy <- accuracy_vec(feat_test$disease, 
                                predict(fit_glmnet, feat_test)$.pred_class))
[1] 0.9299065
feat_test |> tab(disease)
# A tibble: 2 × 3
  disease     n  prop
  <fct>   <int> <dbl>
1 yes        47 0.110
2 no        381 0.890

Question: Perhaps more importantly, are the costs of false positives (screening someone as positive when they do not have heart disease) and false negatives (screening someone as negative when they do have heart disease) comparable for a preliminary screening method?
Show Answer
Probably not.  A false positive might mean that we do more testing that is unnecessary 
and later find out they do not have heart disease.   This comes with some monetary 
cost and also likely some distress for the patient.   However, a false negative 
means we send the patient home thinking they are healthy and may suffer a heart 
attack or other bad outcome.  That seems worse if this is only a preliminary screen.

8.4 The Receiver Operating Characteristic Curve

Let’s return now to consider sensitivity and specificity again

Remember that our classifier is estimating the probability of an observation being in the positive class.

We dichotomize this probability when we formally make a class prediction

  • If the probability > 50%, we classify the observation as positive
  • If the probability <= 50%, we classify the observation as negative

Question: How can we improve sensitivity?
Show Answer
We can use a more liberal/lower classification threshold for saying someone has 
heart disease.  

For example, rather than requiring a 50% probability to classify as yes for heart disease, 
we could lower to 20% for the classification threshold

Question: What will the consequences of this change be?
Show Answer
First, the Bayes classifier threshold of 50% produces the highest overall accuracy 
so accuracy will generally (though not always) drop when you shift from 50%.  

If we think about this change as it applies to the columns of our confusion matrix, 
we will now catch more of the yes (fewer false negatives/misses), so sensitivity 
will go up.  This was the goal of the lower threshold.  However, we will also end up 
with more false positives so specificity will drop.  If you consider the rows of 
the confusion matrix, we will have more false positives so the PPV will drop.  
However, we will have fewer false negatives so the NPV will increase.  Whether 
these trade-offs are worth it are a function of the cost of different types of errors 
and how much you gain and lose with regard to each type of performance metric 
(ROC can inform this; more in a moment)
autoplot(cm, type = "heatmap")


Previously, we simply used predict(fit_glmnet, feat_test)$.pred_class.


$.pred_class dichotomized at 50% by default

  • This is the classification threshold to use with predicted probabilities that will produce the best overall accuracy (e.g., Bayes classifier)
  • However, we can use a different threshold to increase sensitivity or specificity
  • This comes at a cost to the other characteristic (its a trade-off)
    • Lower threshold increases sensitivity but decreases specificity
    • Higher threshold increases specificity but decreases sensitivity

It is relatively easy to make a new confusion matrix and get new performance metrics with a different classification threshold

  • Make a tibble with truth and predicted probabilities
preds <- tibble(truth = feat_test$disease,
                prob = predict(fit_glmnet, feat_test, type = "prob")$.pred_yes)

preds
# A tibble: 428 × 2
   truth    prob
   <fct>   <dbl>
 1 no    0.0178 
 2 no    0.119  
 3 no    0.0204 
 4 no    0.00269
 5 no    0.0105 
 6 no    0.00157
 7 no    0.00574
 8 no    0.0414 
 9 no    0.00897
10 no    0.104  
# ℹ 418 more rows

  • Use this to get class estimates at any threshold we want
  • Here we threshold at 20%
preds <- preds |> 
  mutate(estimate_20 = if_else(prob <= .20, "no", "yes"),
         estimate_20 = factor(estimate_20, levels = c("yes", "no"))) |> 
  print()
# A tibble: 428 × 3
   truth    prob estimate_20
   <fct>   <dbl> <fct>      
 1 no    0.0178  no         
 2 no    0.119   no         
 3 no    0.0204  no         
 4 no    0.00269 no         
 5 no    0.0105  no         
 6 no    0.00157 no         
 7 no    0.00574 no         
 8 no    0.0414  no         
 9 no    0.00897 no         
10 no    0.104   no         
# ℹ 418 more rows

We can now make a confusion matrix for this new set of truth and estimates using the 20% threshold

cm_20 <- preds |> 
  conf_mat(truth = truth, estimate = estimate_20)

cm_20
          Truth
Prediction yes  no
       yes  35  37
       no   12 344

Let’s compare to 50% (original)

cm
          Truth
Prediction yes  no
       yes  30  13
       no   17 368

And let’s compare these two thresholds on a subset of our numeric metrics

  • 20% threshold
cm_20 |> 
  summary() |> 
  filter(.metric %in% c("accuracy", "sens", "spec", "ppv", "npv")) |> 
  select(-.estimator)
# A tibble: 5 × 2
  .metric  .estimate
  <chr>        <dbl>
1 accuracy     0.886
2 sens         0.745
3 spec         0.903
4 ppv          0.486
5 npv          0.966
  • 50% threshold
cm |> 
  summary() |> 
  filter(.metric %in% c("accuracy", "sens", "spec", "ppv", "npv")) |> 
  select(-.estimator)
# A tibble: 5 × 2
  .metric  .estimate
  <chr>        <dbl>
1 accuracy     0.930
2 sens         0.638
3 spec         0.966
4 ppv          0.698
5 npv          0.956

Do the changes on each of these metrics make sense to you? If not, please review these previous slides again!


You can begin to visualize the classifier performance by threshold simply by plotting histograms of the predicted positive class probabilities, separately for the true positive and negative classes

  • Let’s look at our classifier
  • Ideally, the probabilities are mostly low for the true negative class (“no”) and mostly high for the true positive class (“yes”)
  • You can imagine how any specific probability cut point would affect specificity (apply cut to the left panel) or sensitivity (apply cut to the right panel)
ggplot(data = preds, aes(x = prob)) + 
   geom_histogram(bins = 15) +
   facet_wrap(vars(truth), nrow = 2) +
   xlab("Pr(Disease)")


Question: What do you think about its performance? What insights does this plot generate?
Show Answer
1. You can see that we can likely drop the threshold to somewhere about 25% without 
decreasing the specificity too much.  This will allow you to detect more positive cases.  

2.  Our model seems to be able to predict negative cases well.  They mostly have low 
probabilities.   However, you can see its poor performance with positive cases.  
They are spread pretty evenly across the full range of probabilities.  We likely 
do not have enough positive cases in our training data

The Receiver Operating Characteristics (ROC) curve for a classifier provides a more formal method to visualize the trade-offs between sensitivity and specificity across all possible thresholds for classification.

Lets look at this in our example

  • We need columns for truth and probabilities of the positive class for each observation
  • We need to specify the positive class
  • Returns tibble with data to plot an ROC curve
roc_plot <- 
  tibble(truth = feat_test$disease,
         prob = predict(fit_glmnet, feat_test, type = "prob")$.pred_yes) |> 
  roc_curve(prob, truth = truth)

roc_plot
# A tibble: 202 × 3
    .threshold specificity sensitivity
         <dbl>       <dbl>       <dbl>
 1 -Inf            0                 1
 2    0.000663     0                 1
 3    0.000847     0.00787           1
 4    0.00102      0.0131            1
 5    0.00157      0.0210            1
 6    0.00209      0.0262            1
 7    0.00222      0.0315            1
 8    0.00226      0.0341            1
 9    0.00226      0.0394            1
10    0.00263      0.0446            1
# ℹ 192 more rows

We can autoplot() this

autoplot(roc_plot)


Or we can customize a plot passing the data intoggplot()

  • Not doing anything fancy here
  • Consider this a shell for you to build on if you want more than autoplot() provides
roc_plot |>
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_path() +
  geom_abline(lty = 3) +
  coord_equal() +
  labs(x = "1 - Specificity (FPR)",
       y = "Sensitivity (TPR)")


When evaluating an ROC curve:

  • A random classifier would have a diagonal curve from bottom-left to top-right (the dotted line)

  • A perfect classifier would reach up to top-left corner

    • Sensitivity = 1 (true positive rate)
    • 1 - Specificity = 0 (false positive rate)

The ROC Curve is not only a useful method to visualize classier performance across thresholds

The area under the ROC curve (auROC) is an attractive performance metric

  • Ranges from 1.0 (perfect) down to approximately 0.5 (random classifier)
    • If the auROC was consistently less than 0.5, then the predictions could simply be inverted
    • Values between .70 and .80 are considered fair
    • Values between .80 and .90 are considered good
    • Values above .90 are considered excellent
    • These are very rough, and to my eye, the exact cuts and labels are somewhat arbitrary

  • auROC is the probability that the classifier will rank/predict a randomly selected true positive observation higher than a randomly selected true negative observation

  • Alternatively, it can be thought of as the average sensitivity across all decision thresholds

  • auROC summarizes performance (sensitivity vs. specificity trade-off) across all possible thresholds

  • auROC is not affected by class imbalances in contrast to many other metrics


It is easy to get the auROC for the ROC in tidymodels using roc_auc()

As with calculating the ROC curve, we need

  • truth
  • predicted probabilities for the positive class
  • to specify the event_level (default is first)
tibble(truth = feat_test$disease,
       prob = predict(fit_glmnet, feat_test, type = "prob")$.pred_yes) |> 
  roc_auc(prob, truth = truth)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.898

8.5 Using Alternative Performance Metrics for Model Selection

You can select the best model configuration using resampling with performance metrics other than accuracy

  • Aggregate measures are typically your best choice (except potentially with high imbalance - more on this later)

    • For classification:
      • Accuracy
      • Balanced accuracy
      • \(F1\)
      • Area under ROC Curve (auROC)
      • Kappa
    • For regression
      • RMSE
      • \(R^2\)
      • MAE (mean absolute error)

  • You typically need to use a single metric for selection among model configurations
    • You should generally use the performance metric that is the best aligned with your problem:

  • In classification
    • Do you care about types of errors or just overall error rate
    • Is the outcome relatively balanced or unbalanced
    • What metric will be clearest to your audience

  • In regression
    • Do you want to weight big and small errors the same
    • What metric will be clearest to your audience (though all of these are pretty clear. There are more complicated regression metrics)

  • Although you will use one metric to select the best configuration, you can evaluate/characterize the performance of your final model with as many metrics as you like

You should recognize the differences between the cost function for the algorithm and the performance metric:

  • Cost function is fundamental to the definition of the algorithm

  • Cost function is minimized to determine parameter estimates in parametric models

  • Performance metric is independent of algorithm

  • Performance metric is used to select and evaluate model configurations

  • Sometimes they can be the same metric (e.g., RMSE)

  • BUT, this is not required


With tidymodels, it is easy to select hyperparameters or select among model configurations more generally using one of many different performance metrics

  • We will still use either tune_grid() or fit_resamples()
  • We will simply specify a different performance metric inside of metric_set()
  • If we only measure one performance metric, we can use defaults with show_best()

Here is an example of measuring roc_auc() but you can use any performance function from the yardstick package

fits_glmnet_auc <- cache_rds(
  expr = {
    logistic_reg(penalty = tune(), 
                 mixture = tune()) |> 
      set_engine("glmnet") |> 
      set_mode("classification") |> 
      tune_grid(preprocessor = rec, 
                resamples = splits_boot, grid = grid_glmnet, 
                metrics = metric_set(roc_auc))

  },
  rerun = rerun_setting,
  dir = "cache/008/",
  file = "fits_glmnet_auc")

And now use show_best(), with best defined by auROC

show_best(fits_glmnet_auc, n = 5)
# A tibble: 5 × 8
  penalty mixture .metric .estimator  mean     n std_err .config                
    <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                  
1  0.147        0 roc_auc binary     0.905   100 0.00222 Preprocessor1_Model0111
2  0.0756       0 roc_auc binary     0.905   100 0.00220 Preprocessor1_Model0099
3  0.155        0 roc_auc binary     0.905   100 0.00222 Preprocessor1_Model0112
4  0.139        0 roc_auc binary     0.905   100 0.00222 Preprocessor1_Model0110
5  0.124        0 roc_auc binary     0.905   100 0.00222 Preprocessor1_Model0108

You can also measure multiple metrics during resampling but you will need to select the best configuration using only one

  • see metric_set() for the measurement of multiple metrics,
  • see show_best() for the use of metric to indicate which metric to use for selection
fits_glmnet_many <- cache_rds(
  expr = {
    logistic_reg(penalty = tune(), 
                 mixture = tune()) |> 
      set_engine("glmnet") |> 
      set_mode("classification") |> 
      tune_grid(preprocessor = rec, 
                resamples = splits_boot, grid = grid_glmnet, 
                metrics = metric_set(roc_auc, accuracy, sens, spec, bal_accuracy))
  },
  rerun = rerun_setting,
  dir = "cache/008/",
  file = "fits_glmnet_many")

But now we need to indicate how to define best

We will define it based on balanced accuracy

show_best(fits_glmnet_many, metric = "bal_accuracy", n = 5)
# A tibble: 5 × 8
   penalty mixture .metric      .estimator  mean     n std_err
     <dbl>   <dbl> <chr>        <chr>      <dbl> <int>   <dbl>
1 0.000335     1   bal_accuracy binary     0.745   100 0.00378
2 0.000355     1   bal_accuracy binary     0.745   100 0.00378
3 0.000375     0.8 bal_accuracy binary     0.745   100 0.00378
4 0.000396     0.8 bal_accuracy binary     0.745   100 0.00378
5 0.000335     0.8 bal_accuracy binary     0.745   100 0.00379
  .config                
  <chr>                  
1 Preprocessor1_Model1001
2 Preprocessor1_Model1002
3 Preprocessor1_Model0803
4 Preprocessor1_Model0804
5 Preprocessor1_Model0801



And here based on auROC (same as before)

show_best(fits_glmnet_many, metric = "roc_auc", n = 5)
# A tibble: 5 × 8
  penalty mixture .metric .estimator  mean     n std_err .config                
    <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                  
1  0.147        0 roc_auc binary     0.905   100 0.00222 Preprocessor1_Model0111
2  0.0756       0 roc_auc binary     0.905   100 0.00220 Preprocessor1_Model0099
3  0.155        0 roc_auc binary     0.905   100 0.00222 Preprocessor1_Model0112
4  0.139        0 roc_auc binary     0.905   100 0.00222 Preprocessor1_Model0110
5  0.124        0 roc_auc binary     0.905   100 0.00222 Preprocessor1_Model0108

Of course, you can easily calculate any performance metric from yardstick package in test to evaluate your final model

  • Using conf_mat() and summary() as above (but not for auROC)
  • Using *_vec() functions; for example:
    • accuracy_vec()
    • roc_auc_vec()
    • sens_vec(); spec_vec()
  • Piping a tibble that contain truth, and estimate or probability into appropriate function
    • accuracy()
    • roc_auc()
    • sens(); spec()

Fit best model using auROC

fit_glmnet_auc <-   
  logistic_reg(penalty = select_best(fits_glmnet_many, metric = "roc_auc")$penalty, 
               mixture = select_best(fits_glmnet_many, metric = "roc_auc")$mixture) |> 
  set_engine("glmnet") |> 
  set_mode("classification") |> 
  fit(disease ~ ., data = feat_trn)

and get all metrics

cm_auc <- 
  tibble(truth = feat_test$disease,
         estimate = predict(fit_glmnet_auc, feat_test)$.pred_class) |> 
  conf_mat(truth, estimate)

cm_auc |> 
  summary()
# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary        0.928 
 2 kap                  binary        0.479 
 3 sens                 binary        0.340 
 4 spec                 binary        1     
 5 ppv                  binary        1     
 6 npv                  binary        0.925 
 7 mcc                  binary        0.561 
 8 j_index              binary        0.340 
 9 bal_accuracy         binary        0.670 
10 detection_prevalence binary        0.0374
11 precision            binary        1     
12 recall               binary        0.340 
13 f_meas               binary        0.508 

Fit best model using balanced accuracy

fit_glmnet_bal <-   
  logistic_reg(penalty = select_best(fits_glmnet_many, metric = "bal_accuracy")$penalty, 
               mixture = select_best(fits_glmnet_many, metric = "bal_accuracy")$mixture) |> 
  set_engine("glmnet") |> 
  set_mode("classification") |> 
  fit(disease ~ ., data = feat_trn)

and still get all metrics (different because best model configuration is different)

cm_bal <- 
  tibble(truth = feat_test$disease,
         estimate = predict(fit_glmnet_bal, feat_test)$.pred_class) |> 
  conf_mat(truth, estimate)

cm_bal |> 
  summary()
# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.930
 2 kap                  binary         0.628
 3 sens                 binary         0.638
 4 spec                 binary         0.966
 5 ppv                  binary         0.698
 6 npv                  binary         0.956
 7 mcc                  binary         0.628
 8 j_index              binary         0.604
 9 bal_accuracy         binary         0.802
10 detection_prevalence binary         0.100
11 precision            binary         0.698
12 recall               binary         0.638
13 f_meas               binary         0.667

8.6 Advanced Methods for Class Imbalances

When there is a high degree of class imbalance:

  • It is often difficult to build models that predict the minority class well
  • This will yield low sensitivity if the positive class is the minority class (as in our example)
  • This will yield low specificity if the negative class is the minority class
  • Each of these issues may be a problem depending on the costs of FP and FN

Let’s see this at play again in our model

  • Using the 50% threshold we have low sensitivity but good specificity
autoplot(cm)

cm |> 
  summary() |> 
  filter(.metric == "sens" | .metric == "spec") |> 
  select(-.estimator)
# A tibble: 2 × 2
  .metric .estimate
  <chr>       <dbl>
1 sens        0.638
2 spec        0.966

  • We do much better with probabilities for negative (majority) vs. positive (minority) class
  • We can see that we will not affect specificity much by lowering the threshold for positive classification to 20-25%
  • BUT, we really need to do do better with the distribution of probabilities for observations that are positive (yes)
ggplot(data = preds, aes(x = prob)) + 
   geom_histogram(bins = 15) +
   facet_wrap(vars(truth), nrow = 2) +
   xlab("Pr(Disease)")


What can we do when we have imbalanced outcomes?

We will consider:

  • Changes to the classification/decision threshold that trade-off sensitivity vs. specificity for a fitted model (already discussed)
  • Changes to performance metric for selecting the best model configuration (already demonstrated)
  • Sampling/Resampling methods that will affect the balance of the outcome in the training data to fit models that are better with the minority class (new)

8.6.1 Classification (Decision) Threshold

We have already seen an example of how the classification threshold (the probability at which we split between predicting a case as positive vs. negative) affects sensitivity vs. specificity

Decreasing the threshold (probability) for classifying a case as positive will:

  • Increase sensitivity and decrease specificity
  • This will decrease FN but increase FP
  • This may be useful if the positive class is the minority class
  • The ROC curve is a useful display to provide this information about your classifier
    • Curve can be colored to show the threshold
  • The separate histograms by positive vs. negative can also be useful as well
  • If you want to use your data to select the best threshold, you will need yet another set of data to make this selection
    • Can’t make the selection in training b/c those probabilities are overfit
    • Can’t make the selection of threshold in test and then also use the same test data to evaluate that model!

8.6.2 Performance Metric Considerations

When you are choosing a performance metric for selecting your best model configuration, you should choose a performance metric that it best aligned with the nature of the performance you seek

  • If you want just good overall accuracy, accuracy may be a good metric
  • If the outcome is unbalanced, and you care about the types of errors, you might want
    • Balanced accuracy (average of sensitive and specificity)
    • Only sensitivity or specificity by itself (recommended by Kuhn)
    • auROC
    • An F measure (harmonic mean of sensitivity and PPV)
    • May need to think carefully about what is most important to you

Earlier, we saw that we got better sensitivity when we used balanced accuracy rather than accuracy to tune our hyperparameters


8.6.3 Sampling and Resampling to Address Class Imbalance

We can address issues of class imbalance either a priori or post-hoc with respect to data collection

  • A priori method would be to over-sample to get more of the minority class into your training set

  • Use targeted recruiting

  • Can be very costly or impossible in many instances

  • If possible, this can be much better than the resampling approach below

  • Post hoc, we can employ a variety of resampling procedures that are designed to make the training data more balanced

    • We can up-sample the minority class
    • We can down-sample the majority class
    • We can synthesize new minority class observations e.g, SMOTE
  • For both a priori sampling or post-hoc resampling strategies, it is important that your test set is not manipulated. It should represent the expected distribution for the outcome, unaltered


8.6.4 Up-sampling

  • We resample minority class observations with replacement within our training set to increase the number of total observations of the minority class in the training set.
  • This simply duplicates existing minority class observations
  • Our test (or validation) set(s) should NOT be resampled. This is handled well by step_upsample()

Let’s apply this in our example

  • Up-sampling is part of feature engineering recipe
  • Need to specify the outcome (disease)
  • Can set over_ratio to values other than 1 if desired
  • Makes sense to do this after missing data imputation and dummy coding
  • Makes sense to do this before normalizing features
  • These steps are in the themis package rather than recipes (can use namespace or load full library)
rec_up <- recipe(disease ~ ., data = data_trn) |> 
  step_impute_median(all_numeric_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |>   
  step_dummy(all_nominal_predictors()) |> 
  themis::step_upsample(disease, over_ratio = 1) |> 
  step_normalize(all_numeric_predictors())

Need to re-tune model b/c the sample size has changed

Need to refit the final model to all of train

fits_glmnet_up <- cache_rds(
  expr = {
    logistic_reg(penalty = tune(), 
                 mixture = tune()) |> 
      set_engine("glmnet") |> 
      set_mode("classification") |> 
      tune_grid(preprocessor = rec_up, 
                resamples = splits_boot, grid = grid_glmnet, 
                metrics = metric_set(bal_accuracy))
  },
  rerun = rerun_setting,
  dir = "cache/008/",
  file = "fit_glmnet_up")

Review hyperparameter plot

plot_hyperparameters(fits_glmnet_up, hp1 = "penalty", hp2 = "mixture", metric = "bal_accuracy", log_hp1 = TRUE)


Let’s fit this best model configuration to all of our training data and evaluate it in test

  • Note the use of NULL for new_data
  • step_upsample() is only applied to training/held-in but not held-out (truly new) data
  • bake() knows to use the training data provided during prep() if we specify NULL
  • Could have done this all along for baking training. Its sometimes faster but previously same result. Now its necessary!
rec_up_prep <- rec_up |> 
  prep(data_trn)

feat_trn_up <- rec_up_prep |> 
  bake(new_data = NULL)

Notice that disease is now balanced in the training data!

feat_trn_up |> 
  skim_all()
Data summary
Name feat_trn_up
Number of rows 1522
Number of columns 18
_______________________
Column type frequency:
factor 1
numeric 17
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate n_unique top_counts
disease 0 1 2 yes: 761, no: 761

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 skew kurtosis
age 0 1 0 1 -2.82 -0.70 0.08 0.75 2.42 -0.31 -0.53
rest_bp 0 1 0 1 -2.19 -0.73 -0.17 0.40 3.32 0.55 0.40
chol 0 1 0 1 -2.36 -0.72 -0.08 0.64 6.11 1.03 3.51
exer_max_hr 0 1 0 1 -2.63 -0.74 0.11 0.71 2.25 -0.49 -0.23
exer_st_depress 0 1 0 1 -0.94 -0.94 -0.23 0.49 4.05 1.07 0.84
ca 0 1 0 1 -0.74 -0.74 -0.74 0.35 2.53 1.16 0.25
sex_male 0 1 0 1 -1.49 -1.49 0.67 0.67 0.67 -0.82 -1.33
cp_atyp_ang 0 1 0 1 -0.45 -0.45 -0.45 -0.45 2.24 1.79 1.21
cp_non_anginal 0 1 0 1 -1.76 0.57 0.57 0.57 0.57 -1.20 -0.57
fbs_yes 0 1 0 1 -0.42 -0.42 -0.42 -0.42 2.40 1.98 1.93
rest_ecg_wave_abn 0 1 0 1 -0.12 -0.12 -0.12 -0.12 8.45 8.33 67.40
rest_ecg_ventric_hypertrophy 0 1 0 1 -0.96 -0.96 -0.96 1.04 1.04 0.08 -1.99
exer_ang_yes 0 1 0 1 -0.68 -0.68 -0.68 1.48 1.48 0.80 -1.37
exer_st_slope_flat 0 1 0 1 -0.97 -0.97 -0.97 1.03 1.03 0.05 -2.00
exer_st_slope_downslope 0 1 0 1 -0.23 -0.23 -0.23 -0.23 4.36 4.13 15.06
thal_fixeddefect 0 1 0 1 -0.20 -0.20 -0.20 -0.20 5.02 4.82 21.25
thal_reversabledefect 0 1 0 1 -0.82 -0.82 -0.82 1.22 1.22 0.40 -1.84

fit_glmnet_up <-   
  logistic_reg(penalty = select_best(fits_glmnet_up)$penalty, 
               mixture = select_best(fits_glmnet_up)$mixture) |> 
  set_engine("glmnet") |> 
  set_mode("classification") |> 
  fit(disease ~ ., data = feat_trn_up)
To evaluate this model, we now need test features too
- IMPORTANT: Test is NOT up-sampled - bake it as new data!
::: {.cell}
```{.r .cell-code} feat_test <- rec_up_prep |> bake(data_test)
feat_test |> skim_all() ```
::: {.cell-output-display}
Table: Data summary
| | | |:————————|:———| |Name |feat_test | |Number of rows |428 | |Number of columns |18 | |_______________________ | | |Column type frequency: | | |factor |1 | |numeric |17 | |________________________ | | |Group variables |None |
Variable type: factor
|skim_variable | n_missing| complete_rate| n_unique|top_counts | |:————-|———:|————-:|——–:|:—————-| |disease | 0| 1| 2|no: 381, yes: 47 |
Variable type: numeric
|skim_variable | n_missing| complete_rate| mean| sd| p0| p25| p50| p75| p100| skew| kurtosis| |:—————————-|———:|————-:|—–:|—-:|—–:|—–:|—–:|—–:|—-:|—–:|——–:| |age | 0| 1| -0.07| 1.06| -2.26| -1.03| -0.14| 0.75| 2.53| 0.14| -0.75| |rest_bp | 0| 1| -0.20| 0.91| -2.19| -0.73| -0.17| 0.40| 3.78| 0.69| 1.13| |chol | 0| 1| -0.10| 1.02| -2.36| -0.76| -0.27| 0.37| 6.11| 1.99| 8.98| |exer_max_hr | 0| 1| 0.22| 0.89| -3.35| -0.23| 0.37| 0.88| 1.91| -0.73| 0.32| |exer_st_depress | 0| 1| -0.26| 0.89| -0.94| -0.94| -0.58| 0.15| 4.59| 1.69| 3.29| |ca | 0| 1| -0.33| 0.82| -0.74| -0.74| -0.74| -0.74| 2.53| 2.05| 3.44| |sex_male | 0| 1| -0.20| 1.06| -1.49| -1.49| 0.67| 0.67| 0.67| -0.39| -1.85| |cp_atyp_ang | 0| 1| 0.09| 1.08| -0.45| -0.45| -0.45| -0.45| 2.24| 1.49| 0.21| |cp_non_anginal | 0| 1| -0.11| 1.06| -1.76| -1.76| 0.57| 0.57| 0.57| -0.92| -1.15| |fbs_yes | 0| 1| -0.05| 0.95| -0.42| -0.42| -0.42| -0.42| 2.40| 2.18| 2.77| |rest_ecg_wave_abn | 0| 1| 0.00| 1.01| -0.12| -0.12| -0.12| -0.12| 8.45| 8.24| 66.02| |rest_ecg_ventric_hypertrophy | 0| 1| -0.08| 0.99| -0.96| -0.96| -0.96| 1.04| 1.04| 0.23| -1.95| |exer_ang_yes | 0| 1| -0.22| 0.88| -0.68| -0.68| -0.68| -0.68| 1.48| 1.40| -0.04| |exer_st_slope_flat | 0| 1| -0.23| 0.97| -0.97| -0.97| -0.97| 1.03| 1.03| 0.53| -1.72| |exer_st_slope_downslope | 0| 1| 0.10| 1.19| -0.23| -0.23| -0.23| -0.23| 4.36| 3.29| 8.83| |thal_fixeddefect | 0| 1| 0.01| 1.02| -0.20| -0.20| -0.20| -0.20| 5.02| 4.70| 20.11| |thal_reversabledefect | 0| 1| -0.32| 0.88| -0.82| -0.82| -0.82| -0.82| 1.22| 1.17| -0.64|
::: :::

Let’s see how this model performs in test

cm_up <- 
  tibble(truth = feat_test$disease,
         estimate = predict(fit_glmnet_up, feat_test)$.pred_class) |> 
  conf_mat(truth, estimate)

cm_up |> 
  summary()
# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.808
 2 kap                  binary         0.393
 3 sens                 binary         0.830
 4 spec                 binary         0.806
 5 ppv                  binary         0.345
 6 npv                  binary         0.975
 7 mcc                  binary         0.451
 8 j_index              binary         0.636
 9 bal_accuracy         binary         0.818
10 detection_prevalence binary         0.264
11 precision            binary         0.345
12 recall               binary         0.830
13 f_meas               binary         0.487

preds_up <- tibble(truth = feat_test$disease,
                prob = predict(fit_glmnet_up, feat_test, type = "prob")$.pred_yes)

ggplot(data = preds_up, aes(x = prob)) + 
   geom_histogram(bins = 15) +
   facet_wrap(vars(truth), nrow = 2) +
   xlab("Pr(Disease)")


8.6.5 Down-sampling

We resample majority class observations within our training set to decrease/match the number of total observations of the minority class in the training set.

  • This selects a subset of the majority class
  • Our test (or validation) set(s) should NOT be resampled. This is handled well by step_downsample()

Down-sampling is part of feature engineering recipe

  • Need to specify the outcome (disease)
  • Can set under_ratio to values other than 1 if desired
  • Makes sense to do this after missing data imputation and dummy coding
  • Makes sense to do this before normalizing features
rec_down <- recipe(disease ~ ., data = data_trn) |> 
  step_impute_median(all_numeric_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |>   
  step_dummy(all_nominal_predictors()) |> 
  themis::step_downsample(disease, under_ratio = 1) |> 
  step_normalize(all_numeric_predictors())

Need to re-tune model b/c the sample size has changed

Need to refit the final model to all of train

fits_glmnet_down <- cache_rds(
  expr = {
    logistic_reg(penalty = tune(), 
                 mixture = tune()) |> 
      set_engine("glmnet") |> 
      set_mode("classification") |> 
      tune_grid(preprocessor = rec_down, 
                resamples = splits_boot, grid = grid_glmnet, 
                metrics = metric_set(bal_accuracy))
  },
  rerun = rerun_setting,
  dir = "cache/008/",
  file = "fits_glmnet_down")

Review hyperparameters

plot_hyperparameters(fits_glmnet_down, hp1 = "penalty", hp2 = "mixture", 
                     metric = "bal_accuracy", log_hp1 = TRUE)


Let’s fit this best model configuration to all of our training data and evaluate it in test

  • Note the use of NULL again when getting features for training data. Very important!
  • step_downsample() is not applied to baked data (See its default for skip = TRUE argument)
  • NOTE: sample size and ratio of yes/now for disease!
rec_down_prep <- rec_down |> 
  prep(data_trn)

feat_trn_down <- rec_down_prep |> 
  bake(new_data = NULL)

feat_trn_down |> skim_some()
Data summary
Name feat_trn_down
Number of rows 184
Number of columns 18
_______________________
Column type frequency:
factor 1
numeric 17
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
disease 0 1 FALSE 2 yes: 92, no: 92

Variable type: numeric

skim_variable n_missing complete_rate p0 p100
age 0 1 -2.85 2.16
rest_bp 0 1 -1.89 3.50
chol 0 1 -2.52 3.34
exer_max_hr 0 1 -2.56 2.16
exer_st_depress 0 1 -0.87 4.24
ca 0 1 -0.74 2.46
sex_male 0 1 -1.55 0.64
cp_atyp_ang 0 1 -0.47 2.09
cp_non_anginal 0 1 -1.78 0.56
fbs_yes 0 1 -0.47 2.09
rest_ecg_wave_abn 0 1 -0.10 9.51
rest_ecg_ventric_hypertrophy 0 1 -0.99 1.01
exer_ang_yes 0 1 -0.64 1.55
exer_st_slope_flat 0 1 -0.84 1.19
exer_st_slope_downslope 0 1 -0.21 4.68
thal_fixeddefect 0 1 -0.17 5.97
thal_reversabledefect 0 1 -0.79 1.26

Now fit the model to these downsampled training data

fit_glmnet_down <-   
  logistic_reg(penalty = select_best(fits_glmnet_down)$penalty, 
               mixture = select_best(fits_glmnet_down)$mixture) |> 
  set_engine("glmnet") |> 
  set_mode("classification") |> 
  fit(disease ~ ., data = feat_trn_down)

Let’s see how this model performs in test

  • First we need test features
  • IMPORTANT: Test is NOT down-sampled
feat_test <- rec_down_prep |> 
  bake(data_test)

feat_test |> skim_some()
Data summary
Name feat_test
Number of rows 428
Number of columns 18
_______________________
Column type frequency:
factor 1
numeric 17
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
disease 0 1 FALSE 2 no: 381, yes: 47

Variable type: numeric

skim_variable n_missing complete_rate p0 p100
age 0 1 -2.29 2.49
rest_bp 0 1 -2.25 3.97
chol 0 1 -2.52 6.30
exer_max_hr 0 1 -3.27 1.83
exer_st_depress 0 1 -0.87 4.78
ca 0 1 -0.74 2.46
sex_male 0 1 -1.55 0.64
cp_atyp_ang 0 1 -0.47 2.09
cp_non_anginal 0 1 -1.78 0.56
fbs_yes 0 1 -0.47 2.09
rest_ecg_wave_abn 0 1 -0.10 9.51
rest_ecg_ventric_hypertrophy 0 1 -0.99 1.01
exer_ang_yes 0 1 -0.64 1.55
exer_st_slope_flat 0 1 -0.84 1.19
exer_st_slope_downslope 0 1 -0.21 4.68
thal_fixeddefect 0 1 -0.17 5.97
thal_reversabledefect 0 1 -0.79 1.26

Get metrics in test

cm_down <- 
  tibble(truth = feat_test$disease,
         estimate = predict(fit_glmnet_down, feat_test)$.pred_class) |> 
  conf_mat(truth, estimate)

cm_down |> 
  summary()
# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.780
 2 kap                  binary         0.357
 3 sens                 binary         0.851
 4 spec                 binary         0.772
 5 ppv                  binary         0.315
 6 npv                  binary         0.977
 7 mcc                  binary         0.426
 8 j_index              binary         0.623
 9 bal_accuracy         binary         0.811
10 detection_prevalence binary         0.297
11 precision            binary         0.315
12 recall               binary         0.851
13 f_meas               binary         0.460

And plot faceted probabilites

preds_down <- tibble(truth = feat_test$disease,
                prob = predict(fit_glmnet_down, feat_test, type = "prob")$.pred_yes)

ggplot(data = preds_down, aes(x = prob)) + 
   geom_histogram(bins = 15) +
   facet_wrap(vars(truth), nrow = 2) +
   xlab("Pr(Disease)")


8.6.6 SMOTE

A third approach to resampling is called the synthetic minority over-sampling technique (SMOTE)

To up-sample the minority class, SMOTE synthesizes new observations.

  • To do this, an observation is randomly selected from the minority class.
  • This observation’s K-nearest neighbors (KNNs) are then determined.
  • The new synthetic observation retains the outcome but a random combination of the predictors values from the randomly selected observation and its neighbors.

This is easily implemented by recipe in tidymodels using step_smote()

  • Need to specify the outcome (disease)
  • Can set over_ratio to values other than 1 (default) if desired
  • Can set neighbors to values other than 5 (default) if desired
  • Makes sense to do this after missing data imputation and dummy coding
  • Other features will need to be scaled/range-corrected prior to use (for distance)
  • Makes sense to do this before normalizing features for glmnet, etc
rec_smote <- recipe(disease ~ ., data = data_trn) |> 
  step_impute_median(all_numeric_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |>   
  step_dummy(all_nominal_predictors()) |>
  step_range(all_predictors()) |> 
  themis::step_smote(disease, over_ratio = 1, neighbors = 5) |> 
  step_normalize(all_numeric_predictors())

Need to re-tune model b/c the sample size has changed

Need to refit the final model to all of train

fits_glmnet_smote <- cache_rds(
  expr = {
    logistic_reg(penalty = tune(), 
                 mixture = tune()) |> 
      set_engine("glmnet") |> 
      set_mode("classification") |> 
      tune_grid(preprocessor = rec_smote, 
                resamples = splits_boot, grid = grid_glmnet, 
                metrics = metric_set(bal_accuracy))
  },
  rerun = rerun_setting,
  dir = "cache/008/",
  file = "fits_glmnet_smote")

Review hyperparameters

plot_hyperparameters(fits_glmnet_smote, hp1 = "penalty", hp2 = "mixture", 
                     metric = "bal_accuracy", log_hp1 = TRUE)


Let’s fit this best model configuration to all of our training data and evaluate it in test

  • Note the use of NULL (once again!) for new_data
  • step_smote() is not applied to new data
  • NOTE: sample size and outcome balance
rec_smote_prep <- rec_smote |> 
  prep(data_trn)

feat_trn_smote <- rec_smote_prep |> 
  bake(NULL)

feat_trn_smote |> skim_some()
Data summary
Name feat_trn_smote
Number of rows 1522
Number of columns 18
_______________________
Column type frequency:
factor 1
numeric 17
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
disease 0 1 FALSE 2 yes: 761, no: 761

Variable type: numeric

skim_variable n_missing complete_rate p0 p100
age 0 1 -2.97 2.57
rest_bp 0 1 -2.38 3.72
chol 0 1 -2.51 6.44
exer_max_hr 0 1 -2.84 2.42
exer_st_depress 0 1 -1.00 4.71
ca 0 1 -0.78 2.73
sex_male 0 1 -1.60 0.64
cp_atyp_ang 0 1 -0.45 2.29
cp_non_anginal 0 1 -1.81 0.57
fbs_yes 0 1 -0.42 2.52
rest_ecg_wave_abn 0 1 -0.11 11.37
rest_ecg_ventric_hypertrophy 0 1 -0.98 1.05
exer_ang_yes 0 1 -0.68 1.50
exer_st_slope_flat 0 1 -0.97 1.06
exer_st_slope_downslope 0 1 -0.24 4.71
thal_fixeddefect 0 1 -0.19 5.75
thal_reversabledefect 0 1 -0.89 1.18

Fit model to smote sampled training data

fit_glmnet_smote <-   
  logistic_reg(penalty = select_best(fits_glmnet_smote)$penalty, 
               mixture = select_best(fits_glmnet_smote)$mixture) |> 
  set_engine("glmnet") |> 
  set_mode("classification") |> 
  fit(disease ~ ., data = feat_trn_smote)

Let’s see how this model performs in test

  • IMPORTANT: Test is NOT SMOTE up-sampled
feat_test <- rec_smote_prep |> 
  bake(data_test)

cm_smote <- 
  tibble(truth = feat_test$disease,
         estimate = predict(fit_glmnet_smote, feat_test)$.pred_class) |> 
  conf_mat(truth, estimate)

cm_smote |> 
  summary()
# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.811
 2 kap                  binary         0.390
 3 sens                 binary         0.809
 4 spec                 binary         0.811
 5 ppv                  binary         0.345
 6 npv                  binary         0.972
 7 mcc                  binary         0.443
 8 j_index              binary         0.620
 9 bal_accuracy         binary         0.810
10 detection_prevalence binary         0.257
11 precision            binary         0.345
12 recall               binary         0.809
13 f_meas               binary         0.484

Plot faceted predicted probabilites

preds_smote <- tibble(truth = feat_test$disease,
                prob = predict(fit_glmnet_smote, feat_test, type = "prob")$.pred_yes)

ggplot(data = preds_smote, aes(x = prob)) + 
   geom_histogram(bins = 15) +
   facet_wrap(vars(truth), nrow = 2) +
   xlab("Pr(Disease)")


8.7 Discussion

8.7.1 Announcements

  • Exams by the end of the weekend
  • Conceptual exam often curved a bit
  • Lectures released tonight and by end of tomorrow
  • Installing keras package - stay tuned

8.7.2 General

  • I’m not understanding the difference between tuning the model to create a fit object, and then creating a separate fit object to evaluate it using the test data. I know this is fundamental but it somehow has slipped through my understanding.

    • What is “tuning”
    • How do we tune hyperparameters and then evaluate best model config? Using valiation + test. Using 10-fold + test

8.7.3 Confusion matrix

  • NYTimes article
    • new blood test for colon cancer
    • 53,000 die per year (more details)
    • test results: detected 87%, FPR was 10%
    • How well work for 55- 59 yo
    • What if applied to 30 year old
  • More generally
    • define accuracy, sens, spec, ppv, npv, bal_accuracy
    • Cost of FP and FN?
    • impact of decision threshold?

8.7.4 ROC

  • Axes and various names/representations
  • Why is top right perfect performance
  • thresholds
  • interpretation of auROC and the diagonal line
  • Can we go over some more concrete (real-world) examples of when it would be a good idea to use a different threshold for classification?
    • I would like to understand more about adjusting decision thresholds across a few more application contexts. Also, are there any techniques to optimize decision thresholds? Or is it trial and error.
library(tidyverse)
theme_set(theme_classic()) 
devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/format_path.R?raw=true")
ℹ SHA-1 hash of file is "a58e57da996d1b70bb9a5b58241325d6fd78890f"
path_models <- format_path("studydata/risk/models/ema")
preds_hour<- read_rds(file.path(path_models, 
                               "outer_preds_1hour_0_v5_nested_main.rds"))
roc_hour <- preds_hour|> 
  yardstick::roc_curve(prob_beta, truth = label)

Option 1 - TPR vs. FPR

  • sensible axes
  • Less common terms
roc_hour |> 
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
    geom_path(linewidth = 1.25) +
    geom_abline(lty = 3) +
    coord_fixed(xlim = c(0, 1), ylim = c(0, 1)) +
    labs(x = "False Positive Rate (FPR)",
        y = "True Positive Rate (TPR)") +
  scale_x_continuous(breaks = seq(0,1,.25))

Option 2 - Sensitivity vs. 1 - Specificity

  • More common terms
  • 1 - Specificity is confusing!
roc_hour |> 
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
    geom_path(linewidth = 1.25) +
    geom_abline(lty = 3) +
    coord_fixed(xlim = c(0, 1), ylim = c(0, 1)) +
    labs(x = "(1 - Specificity)",
        y = "Sensitivity") +
  scale_x_continuous(breaks = seq(0,1,.25))

Option 3

  • More common terms
  • Sensible (though reversed) axis
  • My preferred format
roc_hour |> 
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
    geom_path(linewidth = 1.25) +
    geom_abline(lty = 3) +
    coord_fixed(xlim = c(0, 1), ylim = c(0, 1)) +
    labs(x = "Specificity",
        y = "Sensitivity") +
  scale_x_continuous(breaks = seq(0,1,.25),
    labels = sprintf("%.2f", seq(1,0,-.25)))


8.7.5 Resampling

  • similarities and differences
  • resampling held-in but not held-out
  • Is there a limit on how much you should up/down-sample or SMOTE? In other words, is there a threshold of imbalance at which it’s not helpful to use one of those techniques, since you are just putting in “fake” data? (Or is “fake data” a bad/incorrect way to think about it?)

8.7.6 Kappa

  • expected accuracy
    • given majority class proportion
    • by chance given base rates
  • ratio of improvement above expected accuracy
Kuhn, Max, and Kjell Johnson. 2018. Applied Predictive Modeling. 1st ed. 2013, Corr. 2nd printing 2018 edition. New York: Springer.
Landis, J. R., and G. G. Koch. 1977. “The Measurement of Observer Agreement for Categorical Data.” Biometrics 33 (1): 159–74.