8 Advanced Performance Metrics

8.1 Overview of Unit

8.1.1 Learning Objectives

Understand costs and benefits of accuracy
Use of a confusion matrix
Understand costs and benefits of other performance metrics
The ROC curve and area under the curve
Model selection using other performance metrics
How to address class imbalance
- Selection of performance metric
- Selection of classification threshold
- Sampling and resampling approaches

8.1.2 Readings

Kuhn and Johnson (2018) Chapter 11, pp 247-266
Kuhn and Johnson (2018) Chapter 16, pp 419-435
Wyant et al, in press

Post questions to the readings channel in Slack

8.1.3 Lecture Videos

Lecture 1: Unit Introduction ~ 15 mins
Lecture 2: The Confusion Matrix, Part 1~ 33 mins
Lecture 3: The Confusion Matrix, Part 2~ 11 mins
Lecture 4: The Receiver Operating Characteristic (ROC) Curve ~ 25 mins
Lecture 5: Selecting Model Configurations with Other Metrics ~ 10 mins
Lecture 6: Addressing Class Imbalance ~ 24 mins

Post questions to the video-lectures channel in Slack

8.1.4 Coding Assignment

Post questions to application-assignments Slack channel

Submit the application assignment here and complete the unit quiz by 8 pm on Wednesday, March 13th

8.2 Introduction

In this unit, we will again use the Cleveland heart disease dataset.

However, I have modified this to make the outcome unbalanced such that Yes represents approximately 10% of the observations

Now that we will calculate performance metrics beyond accuracy, the order of the levels of our outcome variable(disease) matters. We will make sure that the positive class (event of interest; in this case yes for disease) is the first level.

First, lets open and skim the raw data

data_all <- read_csv(here::here(path_data, "cleveland_unbalanced.csv"), 
                     col_names = FALSE, na = "?",col_types = cols()) |> 
  rename(age = X1,
         sex = X2,
         cp = X3,
         rest_bp = X4,
         chol = X5,
         fbs = X6,
         rest_ecg = X7,
         exer_max_hr = X8,
         exer_ang = X9,
         exer_st_depress = X10,
         exer_st_slope = X11,
         ca = X12,
         thal = X13,
         disease = X14) 

data_all |> skim_some()

Data summary
Name	data_all
Number of rows	1281
Number of columns	14
_______________________
Column type frequency:
numeric	14
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	p0	p100
age	0	1.00	29	77.0
sex	0	1.00	0	1.0
cp	0	1.00	1	4.0
rest_bp	0	1.00	94	200.0
chol	0	1.00	126	564.0
fbs	0	1.00	0	1.0
rest_ecg	0	1.00	0	2.0
exer_max_hr	0	1.00	71	202.0
exer_ang	0	1.00	0	1.0
exer_st_depress	0	1.00	0	6.2
exer_st_slope	0	1.00	1	3.0
ca	22	0.98	0	3.0
thal	8	0.99	3	7.0
disease	0	1.00	0	4.0

Code categorical variables as factors with meaningful text labels (and no spaces)

NOTE the use of disease = fct_relevel (disease, "yes") to set yes as positive class (first level) for disease

data_all <- data_all |> 
  mutate(disease = factor(disease, levels = 0:4, 
                          labels = c("no", "yes", "yes", "yes", "yes")),
         disease = fct_relevel (disease, "yes"),
         sex = factor(sex,  levels = c(0, 1), labels = c("female", "male")),
         fbs = factor(fbs, levels = c(0, 1), labels = c("no", "yes")),
         exer_ang = factor(exer_ang, levels = c(0, 1), labels = c("no", "yes")),
         exer_st_slope = factor(exer_st_slope, levels = 1:3, 
                                labels = c("upslope", "flat", "downslope")),
         cp = factor(cp, levels = 1:4, 
                     labels = c("typ_ang", "atyp_ang", "non_anginal", "non_anginal")),
         rest_ecg = factor(rest_ecg, levels = 0:2, 
                           labels = c("normal", "wave_abn", "ventric_hypertrophy")),
         thal = factor(thal, levels = c(3, 6, 7), 
                       labels = c("normal", "fixeddefect", "reversabledefect")))

data_all |> skim_some()

Data summary
Name	data_all
Number of rows	1281
Number of columns	14
_______________________
Column type frequency:
factor	8
numeric	6
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
sex	0	1.00	FALSE	2	mal: 752, fem: 529
cp	0	1.00	FALSE	3	non: 872, aty: 296, typ: 113
fbs	0	1.00	FALSE	2	no: 1104, yes: 177
rest_ecg	0	1.00	FALSE	3	nor: 721, ven: 550, wav: 10
exer_ang	0	1.00	FALSE	2	no: 1044, yes: 237
exer_st_slope	0	1.00	FALSE	3	ups: 778, fla: 434, dow: 69
thal	8	0.99	FALSE	3	nor: 940, rev: 285, fix: 48
disease	0	1.00	FALSE	2	no: 1142, yes: 139

Variable type: numeric

skim_variable	n_missing	complete_rate	p0	p100
age	0	1.00	29	77.0
rest_bp	0	1.00	94	200.0
chol	0	1.00	126	564.0
exer_max_hr	0	1.00	71	202.0
exer_st_depress	0	1.00	0	6.2
ca	22	0.98	0	3.0

Disease is now unbalanced

data_all |> tab(disease)

# A tibble: 2 × 3
  disease     n  prop
  <fct>   <int> <dbl>
1 yes       139 0.109
2 no       1142 0.891

For this example, we will evaluate our final model using a held out test set

set.seed(20140102)
splits_test <- data_all |> 
  initial_split(prop = 2/3, strata = "disease")

data_trn <- splits_test |> 
  analysis()

data_test <- splits_test |> 
  assessment()

We will be fitting a penalized logistic regression again (using glmnet)

We will do only basic feature engineering for this algorithm and to handle missing data

rec <- recipe(disease ~ ., data = data_trn) |> 
  step_impute_median(all_numeric_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |>   
  step_dummy(all_nominal_predictors()) |> 
  step_normalize(all_predictors())

We tune/select best hyperparameter values via bootstrap resampling with the training data

get bootstrap splits
make grid of hyperparameter values

splits_boot <- data_trn |> 
  bootstraps(times = 100, strata = "disease")  

grid_glmnet <- expand_grid(penalty = exp(seq(-8, 3, length.out = 200)),
                           mixture = seq(0, 1, length.out = 6))

fits_glmnet <- cache_rds(
  expr = {
    logistic_reg(penalty = tune(), 
                 mixture = tune()) |> 
      set_engine("glmnet") |> 
      set_mode("classification") |> 
      tune_grid(preprocessor = rec, 
                resamples = splits_boot, grid = grid_glmnet, 
                metrics = metric_set(accuracy))
  },
  dir = "cache/008/",
  file = "fits_glmnet",
  rerun = rerun_setting)

Review hyperparameter plot and best values for hyperparameters

plot_hyperparameters(fits_glmnet, hp1 = "penalty", hp2 = "mixture", 
                     metric = "accuracy", log_hp1 = TRUE)

show_best(fits_glmnet, n = 1)

# A tibble: 1 × 8
   penalty mixture .metric  .estimator  mean     n std_err
     <dbl>   <dbl> <chr>    <chr>      <dbl> <int>   <dbl>
1 0.000583       1 accuracy binary     0.929   100 0.00108
  .config                
  <chr>                  
1 Preprocessor1_Model1011

Let’s fit this best model configuration to all of our training data and evaluate it in test

rec_prep <- rec |> 
  prep(data_trn)
feat_trn <- rec_prep |> 
  bake(data_trn)

fit_glmnet <-   
  logistic_reg(penalty = select_best(fits_glmnet)$penalty, 
               mixture = select_best(fits_glmnet)$mixture) |> 
  set_engine("glmnet") |> 
  set_mode("classification") |> 
  fit(disease ~ ., data = feat_trn)

And evaluate it by predicting into test

feat_test <- rec_prep |> 
  bake(data_test)

(model_accuracy <- accuracy_vec(feat_test$disease, predict(fit_glmnet, feat_test)$.pred_class))

[1] 0.9299065

Accuracy is an attractive measure because it is:

Intuitive and widely understood
Naturally extends to multi-class scenarios
These are not trivial advantages in research or application

However, accuracy has at least three problems in some situations

If the outcome is unbalanced, it can be misleading
- High performance from simply predicting the majority (vs. minority) class for all observations
- Need to anchor evaluation of accuracy to baseline performance based on the majority case percentage
If the outcome is unbalanced, selecting among model configurations with accuracy can be biased toward configurations that predict the majority class because that will yield high accuracy by itself even without any signal from the predictors
Regardless of outcome distribution, it considers false positives (FP) and false negatives (FN) equivalent in their costs
- This is often not the case

Outcome distributions :

May start to be considered unbalanced at ratios of 1:5 (20% of cases in the infrequent class)
In many real life applications (e.g., fraud detection), imbalance ratios ranging from 1:1000 up to 1:5000 are not atypical

When working with unbalanced datasets:

The class or classes with many observations are called the major or majority class(es)
The class with few observations (and there is typically just one) is called the minor or minority class.

In our example, the majority class is negative (no) for heart disease and the minority class is positive (yes) for heart disease

Question: In our example, our model’s accuracy in test seemed high but was it really performing as well as this seems?

Show Answer

Although this test accuracy seems high, this is somewhat misleading.  A model that 
simply labeled everyone as negative for heart disease would achieve almost as high 
accuracy in our test data

(model_accuracy <- accuracy_vec(feat_test$disease, 
                                predict(fit_glmnet, feat_test)$.pred_class))

[1] 0.9299065

feat_test |> tab(disease)

# A tibble: 2 × 3
  disease     n  prop
  <fct>   <int> <dbl>
1 yes        47 0.110
2 no        381 0.890

Question: Perhaps more importantly, are the costs of false positives (screening someone as positive when they do not have heart disease) and false negatives (screening someone as negative when they do have heart disease) comparable for a preliminary screening method?

Show Answer

Probably not.  A false positive might mean that we do more testing that is unnecessary 
and later find out they do not have heart disease.   This comes with some monetary 
cost and also likely some distress for the patient.   However, a false negative 
means we send the patient home thinking they are healthy and may suffer a heart 
attack or other bad outcome.  That seems worse if this is only a preliminary screen.

8.3 The Confusion Matrix and Related Performance Metrics

At a minimum, it seems important to consider these issues explicitly but accuracy is not sufficiently informative.

The first step to this more careful assessment is to construct a confusion matrix

	Ground Truth
Prediction	Positive	Negative
Positive	TP	FP
Negative	FN	TN

Definitions:

TP: True positive
TN: True negative
FP: False positive (Type 1 error/false alarm)
FN: False negative (Type 2 error/miss)

Perfect classifier has all the observations on the diagonal from top left to bottom right

The two types of errors (on the other diagonal) may have different costs

We can now begin to consider these costs

Lets look at the confusion matrix associated with our model’s performance in test

We will use conf_mat() to calculate the confusion matrix
There does NOT (yet?) seem to be a vector version (i.e., conf_mat_vec())
Therefore, we have to build a tibble of truth and estimate to pass into conf_mat()
It is best to assign the result to an object (e.g., cm) because we will use it for a few different tasks

cm <- tibble(truth = feat_test$disease,
             estimate = predict(fit_glmnet, feat_test)$.pred_class) |> 
  conf_mat(truth, estimate)

Let’s display the matrix

The columns are sorted based on the true value for the observation (i.e., ground truth)
- In this case, that is the patients’ true disease status
- We can see it is unbalanced with most of the cases in the “no” column
The rows are sorted by our model’s predictions
As we noted above, correct predictions fall on the top/left- bottom/right diagonal

cm

          Truth
Prediction yes  no
       yes  30  13
       no   17 368

Tidy model’s makes it easy to visualize this matrix in one of two types of plots

mosaic (the default)

autoplot(cm)

heatmap

autoplot(cm, type = "heatmap")

Regardless of the plot, you can now begin to see the issues with our model

It seems accurate for patients that do NOT have heart disease
- 368/381 correct
It is not very accurate for patients that DO have heart disease
- 30/47 correct
This differential performance was masked by our global accuracy measure because overall accuracy was weighted heavily toward accuracy for patients without heart disease given their much higher numbers

We can use this confusion matrix as the starting point for MANY other common metrics and methods for evaluating the performance of a classification model. In many instances, the metrics come in pairs that are relevant for FP and FN errors

Sensitivity & Specificity
Positive and Negative Predictive Value (PPV, NPV)
Precision and Recall

There are also some single metric approaches (like accuracy) that may be preferred when the outcome is unbalanced

Balanced accuracy
F1 (and other F-scores)
Kappa

There are also graphical approaches are based on either sensitivity/specificity or precision/recall. These are:

The Receiver Operating Characteristic (ROC) curve
The Precision/Recall Curve (not covered further in this unit)
Each of these curves also yields a single metric that represents the area under the curve

The best metric/method is a function both of your intended use and the class distributions

As noted, accuracy is widely understood but can be misleading when class distributions are highly unbalanced
Sensitivity/Specificity are common in literatures that consider diagnosis (clinical psychology/psychiatry, medicine)
Positive/Negative predictive value are key to consider with sensitivity/specificity when classes are unbalanced
ROC curve and its auROC metric provide nice summary visualization and metric when considering classification thresholds other than 50% (also common when classes are unbalanced or types of errors matter)

Here are definitions of many of the most common metrics linked to the confusion matrix

	Ground Truth
Prediction	Positive	Negative
Positive	TP	FP
Negative	FN	TN

$Accuracy = \frac{TN + TP}{TN + TP + FN + FP}$

$Sensitivity\:(Recall) = \frac{TP}{TP + FN}$

$Specificity = \frac{TN}{TN + FP}$

$Positive\:Predictive\:Value\:(Precision) = \frac{TP}{TP + FP}$

$Negative\:Predictive\:Value = \frac{TN}{TN + FN}$

$Balanced\:accuracy = \frac{Sensitivity + Specificity}{2}$

$F_\beta$ score:

$F1 = 2 * \frac{Precision * Recall}{Precision + Recall}$ (most common; harmonic mean of precision and recall)
$F_\beta = (1 + \beta^2) * \frac{Precision * Recall}{(\beta^2*Precision) + Recall}$
$F_\beta$ was derived so that it measures the effectiveness of a classifier for someone who assigns $\beta$ times as much importance to recall as precision

It is easy to get any of these performance metrics using summary() on the confusion matrix

Many of the statistics generated are based on an understanding of which level is the positive level.

Tidymodels (yardstick to be precise) will default to consider the first level the positive level.
If this is not true, some statistics (e.g., sensitivity, specificity) will be incorrect (i.e., swapped).
You can override this default by setting the following parameter inside any function that is affected by the order of the classes” event_level = "second"

Here is summary in action!

cm |> summary()

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.930
 2 kap                  binary         0.628
 3 sens                 binary         0.638
 4 spec                 binary         0.966
 5 ppv                  binary         0.698
 6 npv                  binary         0.956
 7 mcc                  binary         0.628
 8 j_index              binary         0.604
 9 bal_accuracy         binary         0.802
10 detection_prevalence binary         0.100
11 precision            binary         0.698
12 recall               binary         0.638
13 f_meas               binary         0.667

Let’s consider some of what these metrics are telling us about our classifier by looking at the metrics and a confusion matrix plot

Let’s start with sensitivity and specificity and their arithmetic mean (balanced accuracy)

These are column specific accuracies
Focus is on truth (columns)
Focuses a priori on two types of patients that may walk into the clinic to use our classifier

cm |> 
  summary() |> 
  filter(.metric == "sens" | .metric == "spec" | .metric == "bal_accuracy") |> 
  select(-.estimator)

# A tibble: 3 × 2
  .metric      .estimate
  <chr>            <dbl>
1 sens             0.638
2 spec             0.966
3 bal_accuracy     0.802

autoplot(cm, type = "heatmap")

Question: Can you link the numbers in the confusion matrix on the previous slide to Sensitivity and Specificity metrics

Show Answer

Sensivity is "accuracy" for the positive cases (in this instance, those with 
disease = yes).  Sensitivity = 30 / (17 + 30)

Specificity is "accuracy" for the negative cases (disease = no).  
Specificity = 368 / (368 + 13)

Now let’s consider positive predictive value and negative predictive value

These are row specific accuracies
Focus is on model predictions (rows)
Focuses on the utility of the information/screening result provided from our classifier
Not typically reported alone but instead in combo with sensitivity/specificity and prevalence (see next pages)
Mosaic plot is better visualization for sensitivity/specificity (though I also like the numbers). Not that useful for PPV/NPV
Use heatmap?

autoplot(cm, type = "heatmap")

autoplot(cm, type = "heatmap")

cm |> 
  summary() |>
  filter(.metric == "ppv" | .metric == "npv") |> 
  select(-.estimator)

# A tibble: 2 × 2
  .metric .estimate
  <chr>       <dbl>
1 ppv         0.698
2 npv         0.956

Question: Can you link the numbers in the confusion matrix on the previous slide to PPV and NPV metrics

Show Answer

PPV is "accuracy" for the positive predictions (in this instance, when the 
model predicts yes. 
PPV = 30 / (30 + 13)

NPV is "accuracy" for the negative predictions (disease = no).  
NPV = 368 / (368 + 17)

PPV and NPV are influenced by both sensitivity and specificity BUT ALSO prevalence.
This becomes important in unbalanced settings where prevalence of classes is not equal
- Your classifier’s PPV will be lower, even with good sensitivity and specificity if the prevalence of the positive class is low
- Conversely, your classifier’s NPV will be lower, even with good sensitivity and specificity, if the prevalence of the negative class is low.
Prevalence also can vary by testing setting

Tests for many genetic disorders have very good sensitivity and specificity but their PPV (and NPV) vary widely by setting/patients tested

Test for multiple endocrine neoplasia type 2 (MEN2) based on mutations in RET
- Sensitivity = 98%
- Specificity = 99.9%

MENS2 has prevalence of 1/30,000 in general population. If using the test in the general population with 3 million people:

100 will have the disease
2,999,900 will not have the disease
Column accuracies (sensitivity and specificity) are high
PPV will be very BAD; 98/(3000 + 98) = 3.2%
Though NPV will be very near perfect! 2996900 / (2996900 + 2)

	Ground Truth
Prediction	Positive	Negative
Positive	98	3000
Negative	2	2996900

However, MENS2 prevalence is high (1/5) among patients who present in a clinic with medullary thyroid carcinoma. If we only used the test among 3 million of these patients

600,000 will have the disease
2,400,000 will NOT have the disease (still unbalanced by but much less)
Column accuracies (sensitivity and specificity) remain the same (98% and 99.9%)
- These are properties of the test/classifier
PPV is now much better; 588,000 / (2400 + 588,000) = 99.6%

	Ground Truth
Prediction	Positive	Negative
Positive	588000	2400
Negative	12000	2397600

Now think about “accuracy” of any specific COVID test

It dismayed me to see talk of accuracy
- The cost of the two types of errors is different!
Occasionally, there was talk of sensitivity and specificity
There was rarely/never discussion of PPV and NPV, which is what matters most when you are given your test result

Question: How would the PPV and NPV change when we moved from testing only people with symptoms who presented at the hospital to testing everyone (e.g., all college students)?

Show Answer

Relatively speaking, when testing someone with obvious COVID symptoms PPV would 
be high but NPV could be low.  Conversely, for our students PPV is likely low but 
NPV is likely high

In some instances, it may be more useful to focus on precision and recall rather than sensitivity and specificity. The F1 measure is the harmonic mean of precision and recall

This is a row and column accuracy
Recall (sensitivity) focuses on how many true positive cases will we correctly identify
Precision (PPV) focuses on how accurate the prediction of “positive” will be (prevalence dependent)
This keeps the focus on positive cases

autoplot(cm, type = "heatmap")

autoplot(cm, type = "heatmap")

cm |> 
  summary() |> 
  filter(.metric == "precision" | .metric == "recall" | .metric == "f_meas") |> 
  select(-.estimator)

# A tibble: 3 × 2
  .metric   .estimate
  <chr>         <dbl>
1 precision     0.698
2 recall        0.638
3 f_meas        0.667

Question: Can you link the numbers in the confusion matrix on the previous slide to Recall (Sensitivity) and Precision (PPV) metrics

Show Answer

Recall/sensitivity is "accuracy" for the positive cases (in this instance, patients 
with heart disease).
30 / (30 + 17)

Precision/PPV is "accuracy" for the positive predictions (when model predicts yes).  
30 / (30 + 13)

$F1$ is the harmonic mean of Recall and Precision

Harmonic means are used with rates (see more detail about the Pythagorean means, if interested)
$F1$ is an unweighted harmonic mean
- $F1 = 2 * \frac{Precision * Recall}{Precision + Recall}$
- or using the more general formula for harmonic means: $F1 = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}}$
$F_\beta$ is a weighted version where $\beta$ is the relative weighting of recall to precision
- Two commonly used values for $\beta$ are 2, which weighs recall twice as much than precision, and 0.5, which weighs precision twice as much as recall
- $F_\beta = (1 + \beta^2) * \frac{Precision * Recall}{(\beta^2*Precision) + Recall}$

cm |> 
  summary() |> 
  filter(.metric == "precision" | .metric == "recall" | .metric == "f_meas") |> 
  select(-.estimator)

# A tibble: 3 × 2
  .metric   .estimate
  <chr>         <dbl>
1 precision     0.698
2 recall        0.638
3 f_meas        0.667

Cohen’s Kappa is a bit more complicated to calculate and understand

There is a great explanation of kappa on stack overflow
Wikipedia also has a very detailed definition and explanation as well (though not in the context of machine learning)
Compares observed accuracy to expected accuracy (random chance)

$Kappa = \frac{observed\:accuracy - expected\:accuracy}{1 - expected\:accuracy}$

When outcome is unbalanced, some agreement/accuracy (relationship between model predictions and reference/ground truth) is expected
Kappa adjusts for this

Kappa is essentially the proportional increase in accuracy above the accuracy expected by the base rates of the reference and classifier

To calculate the expected accuracy, we need to consider the probabilities of reference and classifier prediction both being positive (and both being negative) by chance given the base rates of these classes for the reference and classifier.

	Ground Truth
Prediction	Positive	Negative
Positive	TP	FP
Negative	FN	TN

$P_{positive} = \frac{FN + TP}{TN + FN + FP + TP} * \frac{FP + TP}{TN + FN + FP + TP}$

$P_{negative} = \frac{TN + FP}{TN + FN + FP + TP} * \frac{TN + FN}{TN + FN + FP + TP}$

$P_{expected} = P_{positive} + P_{negative}$

In our example

cm

          Truth
Prediction yes  no
       yes  30  13
       no   17 368

$P_{positive} = \frac{17 + 30}{428} * \frac{13 + 30}{428} = 0.01103262$

$P_{negative} = \frac{368 + 13}{428} * \frac{368 + 17}{428} = 0.8007522$

$P_{expected} = 0.01103262 + 0.8007522 = 0.8117848$

$Actual Accuracy = 0.9299065$

$Kappa = \frac{observed\:accuracy - expected\:accuracy}{1 - expected\:accuracy}$

$Kappa = \frac{0.9299065 - 0.8117848}{1 - 0.8117848} = 0.627588$

summary(cm) |> 
  filter(.metric == "kap") |> 
  pull(.estimate)

[1] 0.6275886

Kappa Rules of Thumb w/ a big grain of salt…….

Kappa < 0: No agreement
Kappa between 0.00 and 0.20: Slight agreement
Kappa between 0.21 and 0.40: Fair agreement
Kappa between 0.41 and 0.60: Moderate agreement
Kappa between 0.61 and 0.80: Substantial agreement
Kappa between 0.81 and 1.00: Almost perfect agreement.

See Landis and Koch (1977) (PDF)for more details

8.4 The Receiver Operating Characteristic Curve

Let’s return now to consider sensitivity and specificity again

Remember that our classifier is estimating the probability of an observation being in the positive class.

We dichotomize this probability when we formally make a class prediction

If the probability > 50%, we classify the observation as positive
If the probability <= 50%, we classify the observation as negative

Question: How can we improve sensitivity?

Show Answer

We can use a more liberal/lower classification threshold for saying someone has 
heart disease.  

For example, rather than requiring a 50% probability to classify as yes for heart disease, 
we could lower to 20% for the classification threshold

Question: What will the consequences of this change be?

Show Answer

First, the Bayes classifier threshold of 50% produces the highest overall accuracy 
so accuracy will generally (though not always) drop when you shift from 50%.  

If we think about this change as it applies to the columns of our confusion matrix, 
we will now catch more of the yes (fewer false negatives/misses), so sensitivity 
will go up.  This was the goal of the lower threshold.  However, we will also end up 
with more false positives so specificity will drop.  If you consider the rows of 
the confusion matrix, we will have more false positives so the PPV will drop.  
However, we will have fewer false negatives so the NPV will increase.  Whether 
these trade-offs are worth it are a function of the cost of different types of errors 
and how much you gain and lose with regard to each type of performance metric 
(ROC can inform this; more in a moment)

autoplot(cm, type = "heatmap")

Previously, we simply used predict(fit_glmnet, feat_test)$.pred_class.

$.pred_class dichotomized at 50% by default

This is the classification threshold to use with predicted probabilities that will produce the best overall accuracy (e.g., Bayes classifier)
However, we can use a different threshold to increase sensitivity or specificity
This comes at a cost to the other characteristic (its a trade-off)
- Lower threshold increases sensitivity but decreases specificity
- Higher threshold increases specificity but decreases sensitivity

It is relatively easy to make a new confusion matrix and get new performance metrics with a different classification threshold

Make a tibble with truth and predicted probabilities

preds <- tibble(truth = feat_test$disease,
                prob = predict(fit_glmnet, feat_test, type = "prob")$.pred_yes)

preds

# A tibble: 428 × 2
   truth    prob
   <fct>   <dbl>
 1 no    0.0178 
 2 no    0.119  
 3 no    0.0204 
 4 no    0.00269
 5 no    0.0105 
 6 no    0.00157
 7 no    0.00574
 8 no    0.0414 
 9 no    0.00897
10 no    0.104  
# ℹ 418 more rows

Use this to get class estimates at any threshold we want
Here we threshold at 20%

preds <- preds |> 
  mutate(estimate_20 = if_else(prob <= .20, "no", "yes"),
         estimate_20 = factor(estimate_20, levels = c("yes", "no"))) |> 
  print()

# A tibble: 428 × 3
   truth    prob estimate_20
   <fct>   <dbl> <fct>      
 1 no    0.0178  no         
 2 no    0.119   no         
 3 no    0.0204  no         
 4 no    0.00269 no         
 5 no    0.0105  no         
 6 no    0.00157 no         
 7 no    0.00574 no         
 8 no    0.0414  no         
 9 no    0.00897 no         
10 no    0.104   no         
# ℹ 418 more rows

We can now make a confusion matrix for this new set of truth and estimates using the 20% threshold

cm_20 <- preds |> 
  conf_mat(truth = truth, estimate = estimate_20)

cm_20

          Truth
Prediction yes  no
       yes  35  37
       no   12 344

Let’s compare to 50% (original)

cm

          Truth
Prediction yes  no
       yes  30  13
       no   17 368

And let’s compare these two thresholds on a subset of our numeric metrics

20% threshold

cm_20 |> 
  summary() |> 
  filter(.metric %in% c("accuracy", "sens", "spec", "ppv", "npv")) |> 
  select(-.estimator)

# A tibble: 5 × 2
  .metric  .estimate
  <chr>        <dbl>
1 accuracy     0.886
2 sens         0.745
3 spec         0.903
4 ppv          0.486
5 npv          0.966

50% threshold

cm |> 
  summary() |> 
  filter(.metric %in% c("accuracy", "sens", "spec", "ppv", "npv")) |> 
  select(-.estimator)

# A tibble: 5 × 2
  .metric  .estimate
  <chr>        <dbl>
1 accuracy     0.930
2 sens         0.638
3 spec         0.966
4 ppv          0.698
5 npv          0.956

Do the changes on each of these metrics make sense to you? If not, please review these previous slides again!

You can begin to visualize the classifier performance by threshold simply by plotting histograms of the predicted positive class probabilities, separately for the true positive and negative classes

Let’s look at our classifier
Ideally, the probabilities are mostly low for the true negative class (“no”) and mostly high for the true positive class (“yes”)
You can imagine how any specific probability cut point would affect specificity (apply cut to the left panel) or sensitivity (apply cut to the right panel)

ggplot(data = preds, aes(x = prob)) + 
   geom_histogram(bins = 15) +
   facet_wrap(vars(truth), nrow = 2) +
   xlab("Pr(Disease)")

Question: What do you think about its performance? What insights does this plot generate?

Show Answer

1. You can see that we can likely drop the threshold to somewhere about 25% without 
decreasing the specificity too much.  This will allow you to detect more positive cases.  

2.  Our model seems to be able to predict negative cases well.  They mostly have low 
probabilities.   However, you can see its poor performance with positive cases.  
They are spread pretty evenly across the full range of probabilities.  We likely 
do not have enough positive cases in our training data

The Receiver Operating Characteristics (ROC) curve for a classifier provides a more formal method to visualize the trade-offs between sensitivity and specificity across all possible thresholds for classification.

Lets look at this in our example

We need columns for truth and probabilities of the positive class for each observation
We need to specify the positive class
Returns tibble with data to plot an ROC curve

roc_plot <- 
  tibble(truth = feat_test$disease,
         prob = predict(fit_glmnet, feat_test, type = "prob")$.pred_yes) |> 
  roc_curve(prob, truth = truth)

roc_plot

# A tibble: 202 × 3
    .threshold specificity sensitivity
         <dbl>       <dbl>       <dbl>
 1 -Inf            0                 1
 2    0.000663     0                 1
 3    0.000847     0.00787           1
 4    0.00102      0.0131            1
 5    0.00157      0.0210            1
 6    0.00209      0.0262            1
 7    0.00222      0.0315            1
 8    0.00226      0.0341            1
 9    0.00226      0.0394            1
10    0.00263      0.0446            1
# ℹ 192 more rows

We can autoplot() this

autoplot(roc_plot)

Or we can customize a plot passing the data intoggplot()

Not doing anything fancy here
Consider this a shell for you to build on if you want more than autoplot() provides

roc_plot |>
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_path() +
  geom_abline(lty = 3) +
  coord_equal() +
  labs(x = "1 - Specificity (FPR)",
       y = "Sensitivity (TPR)")

When evaluating an ROC curve:

A random classifier would have a diagonal curve from bottom-left to top-right (the dotted line)
A perfect classifier would reach up to top-left corner
- Sensitivity = 1 (true positive rate)
- 1 - Specificity = 0 (false positive rate)

The ROC Curve is not only a useful method to visualize classier performance across thresholds

The area under the ROC curve (auROC) is an attractive performance metric

Ranges from 1.0 (perfect) down to approximately 0.5 (random classifier)
- If the auROC was consistently less than 0.5, then the predictions could simply be inverted
- Values between .70 and .80 are considered fair
- Values between .80 and .90 are considered good
- Values above .90 are considered excellent
- These are very rough, and to my eye, the exact cuts and labels are somewhat arbitrary

auROC is the probability that the classifier will rank/predict a randomly selected true positive observation higher than a randomly selected true negative observation
Alternatively, it can be thought of as the average sensitivity across all decision thresholds
auROC summarizes performance (sensitivity vs. specificity trade-off) across all possible thresholds
auROC is not affected by class imbalances in contrast to many other metrics

It is easy to get the auROC for the ROC in tidymodels using roc_auc()

As with calculating the ROC curve, we need

truth
predicted probabilities for the positive class
to specify the event_level (default is first)

tibble(truth = feat_test$disease,
       prob = predict(fit_glmnet, feat_test, type = "prob")$.pred_yes) |> 
  roc_auc(prob, truth = truth)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.898

8.5 Using Alternative Performance Metrics for Model Selection

You can select the best model configuration using resampling with performance metrics other than accuracy

Aggregate measures are typically your best choice (except potentially with high imbalance - more on this later)
- For classification:
  - Accuracy
  - Balanced accuracy
  - $F1$
  - Area under ROC Curve (auROC)
  - Kappa
- For regression
  - RMSE
  - $R^2$
  - MAE (mean absolute error)

You typically need to use a single metric for selection among model configurations
- You should generally use the performance metric that is the best aligned with your problem:
In classification
- Do you care about types of errors or just overall error rate
- Is the outcome relatively balanced or unbalanced
- What metric will be clearest to your audience
In regression
- Do you want to weight big and small errors the same
- What metric will be clearest to your audience (though all of these are pretty clear. There are more complicated regression metrics)
Although you will use one metric to select the best configuration, you can evaluate/characterize the performance of your final model with as many metrics as you like

You should recognize the differences between the cost function for the algorithm and the performance metric:

Cost function is fundamental to the definition of the algorithm
Cost function is minimized to determine parameter estimates in parametric models
Performance metric is independent of algorithm
Performance metric is used to select and evaluate model configurations
Sometimes they can be the same metric (e.g., RMSE)
BUT, this is not required

With tidymodels, it is easy to select hyperparameters or select among model configurations more generally using one of many different performance metrics

We will still use either tune_grid() or fit_resamples()
We will simply specify a different performance metric inside of metric_set()
If we only measure one performance metric, we can use defaults with show_best()

Here is an example of measuring roc_auc() but you can use any performance function from the yardstick package

fits_glmnet_auc <- cache_rds(
  expr = {
    logistic_reg(penalty = tune(), 
                 mixture = tune()) |> 
      set_engine("glmnet") |> 
      set_mode("classification") |> 
      tune_grid(preprocessor = rec, 
                resamples = splits_boot, grid = grid_glmnet, 
                metrics = metric_set(roc_auc))

  },
  rerun = rerun_setting,
  dir = "cache/008/",
  file = "fits_glmnet_auc")

And now use show_best(), with best defined by auROC

show_best(fits_glmnet_auc, n = 5)

# A tibble: 5 × 8
  penalty mixture .metric .estimator  mean     n std_err .config                
    <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                  
1  0.147        0 roc_auc binary     0.905   100 0.00222 Preprocessor1_Model0111
2  0.0756       0 roc_auc binary     0.905   100 0.00220 Preprocessor1_Model0099
3  0.155        0 roc_auc binary     0.905   100 0.00222 Preprocessor1_Model0112
4  0.139        0 roc_auc binary     0.905   100 0.00222 Preprocessor1_Model0110
5  0.124        0 roc_auc binary     0.905   100 0.00222 Preprocessor1_Model0108

You can also measure multiple metrics during resampling but you will need to select the best configuration using only one

see metric_set() for the measurement of multiple metrics,
see show_best() for the use of metric to indicate which metric to use for selection

fits_glmnet_many <- cache_rds(
  expr = {
    logistic_reg(penalty = tune(), 
                 mixture = tune()) |> 
      set_engine("glmnet") |> 
      set_mode("classification") |> 
      tune_grid(preprocessor = rec, 
                resamples = splits_boot, grid = grid_glmnet, 
                metrics = metric_set(roc_auc, accuracy, sens, spec, bal_accuracy))
  },
  rerun = rerun_setting,
  dir = "cache/008/",
  file = "fits_glmnet_many")

But now we need to indicate how to define best

We will define it based on balanced accuracy

show_best(fits_glmnet_many, metric = "bal_accuracy", n = 5)

# A tibble: 5 × 8
   penalty mixture .metric      .estimator  mean     n std_err
     <dbl>   <dbl> <chr>        <chr>      <dbl> <int>   <dbl>
1 0.000335     1   bal_accuracy binary     0.745   100 0.00378
2 0.000355     1   bal_accuracy binary     0.745   100 0.00378
3 0.000375     0.8 bal_accuracy binary     0.745   100 0.00378
4 0.000396     0.8 bal_accuracy binary     0.745   100 0.00378
5 0.000335     0.8 bal_accuracy binary     0.745   100 0.00379
  .config                
  <chr>                  
1 Preprocessor1_Model1001
2 Preprocessor1_Model1002
3 Preprocessor1_Model0803
4 Preprocessor1_Model0804
5 Preprocessor1_Model0801

And here based on auROC (same as before)

show_best(fits_glmnet_many, metric = "roc_auc", n = 5)

# A tibble: 5 × 8
  penalty mixture .metric .estimator  mean     n std_err .config                
    <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                  
1  0.147        0 roc_auc binary     0.905   100 0.00222 Preprocessor1_Model0111
2  0.0756       0 roc_auc binary     0.905   100 0.00220 Preprocessor1_Model0099
3  0.155        0 roc_auc binary     0.905   100 0.00222 Preprocessor1_Model0112
4  0.139        0 roc_auc binary     0.905   100 0.00222 Preprocessor1_Model0110
5  0.124        0 roc_auc binary     0.905   100 0.00222 Preprocessor1_Model0108

Of course, you can easily calculate any performance metric from yardstick package in test to evaluate your final model

Using conf_mat() and summary() as above (but not for auROC)
Using *_vec() functions; for example:
- accuracy_vec()
- roc_auc_vec()
- sens_vec(); spec_vec()
Piping a tibble that contain truth, and estimate or probability into appropriate function
- accuracy()
- roc_auc()
- sens(); spec()

Fit best model using auROC

fit_glmnet_auc <-   
  logistic_reg(penalty = select_best(fits_glmnet_many, metric = "roc_auc")$penalty, 
               mixture = select_best(fits_glmnet_many, metric = "roc_auc")$mixture) |> 
  set_engine("glmnet") |> 
  set_mode("classification") |> 
  fit(disease ~ ., data = feat_trn)

and get all metrics

cm_auc <- 
  tibble(truth = feat_test$disease,
         estimate = predict(fit_glmnet_auc, feat_test)$.pred_class) |> 
  conf_mat(truth, estimate)

cm_auc |> 
  summary()

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary        0.928 
 2 kap                  binary        0.479 
 3 sens                 binary        0.340 
 4 spec                 binary        1     
 5 ppv                  binary        1     
 6 npv                  binary        0.925 
 7 mcc                  binary        0.561 
 8 j_index              binary        0.340 
 9 bal_accuracy         binary        0.670 
10 detection_prevalence binary        0.0374
11 precision            binary        1     
12 recall               binary        0.340 
13 f_meas               binary        0.508

Fit best model using balanced accuracy

fit_glmnet_bal <-   
  logistic_reg(penalty = select_best(fits_glmnet_many, metric = "bal_accuracy")$penalty, 
               mixture = select_best(fits_glmnet_many, metric = "bal_accuracy")$mixture) |> 
  set_engine("glmnet") |> 
  set_mode("classification") |> 
  fit(disease ~ ., data = feat_trn)

and still get all metrics (different because best model configuration is different)

cm_bal <- 
  tibble(truth = feat_test$disease,
         estimate = predict(fit_glmnet_bal, feat_test)$.pred_class) |> 
  conf_mat(truth, estimate)

cm_bal |> 
  summary()

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.930
 2 kap                  binary         0.628
 3 sens                 binary         0.638
 4 spec                 binary         0.966
 5 ppv                  binary         0.698
 6 npv                  binary         0.956
 7 mcc                  binary         0.628
 8 j_index              binary         0.604
 9 bal_accuracy         binary         0.802
10 detection_prevalence binary         0.100
11 precision            binary         0.698
12 recall               binary         0.638
13 f_meas               binary         0.667

8.6 Advanced Methods for Class Imbalances

When there is a high degree of class imbalance:

It is often difficult to build models that predict the minority class well
This will yield low sensitivity if the positive class is the minority class (as in our example)
This will yield low specificity if the negative class is the minority class
Each of these issues may be a problem depending on the costs of FP and FN

Let’s see this at play again in our model

Using the 50% threshold we have low sensitivity but good specificity

autoplot(cm)

cm |> 
  summary() |> 
  filter(.metric == "sens" | .metric == "spec") |> 
  select(-.estimator)

# A tibble: 2 × 2
  .metric .estimate
  <chr>       <dbl>
1 sens        0.638
2 spec        0.966

We do much better with probabilities for negative (majority) vs. positive (minority) class
We can see that we will not affect specificity much by lowering the threshold for positive classification to 20-25%
BUT, we really need to do do better with the distribution of probabilities for observations that are positive (yes)

ggplot(data = preds, aes(x = prob)) + 
   geom_histogram(bins = 15) +
   facet_wrap(vars(truth), nrow = 2) +
   xlab("Pr(Disease)")

What can we do when we have imbalanced outcomes?

We will consider:

Changes to the classification/decision threshold that trade-off sensitivity vs. specificity for a fitted model (already discussed)
Changes to performance metric for selecting the best model configuration (already demonstrated)
Sampling/Resampling methods that will affect the balance of the outcome in the training data to fit models that are better with the minority class (new)

8.6.1 Classification (Decision) Threshold

We have already seen an example of how the classification threshold (the probability at which we split between predicting a case as positive vs. negative) affects sensitivity vs. specificity

Decreasing the threshold (probability) for classifying a case as positive will:

Increase sensitivity and decrease specificity
This will decrease FN but increase FP
This may be useful if the positive class is the minority class
The ROC curve is a useful display to provide this information about your classifier
- Curve can be colored to show the threshold
The separate histograms by positive vs. negative can also be useful as well
If you want to use your data to select the best threshold, you will need yet another set of data to make this selection
- Can’t make the selection in training b/c those probabilities are overfit
- Can’t make the selection of threshold in test and then also use the same test data to evaluate that model!

8.6.2 Performance Metric Considerations

When you are choosing a performance metric for selecting your best model configuration, you should choose a performance metric that it best aligned with the nature of the performance you seek

If you want just good overall accuracy, accuracy may be a good metric
If the outcome is unbalanced, and you care about the types of errors, you might want
- Balanced accuracy (average of sensitive and specificity)
- Only sensitivity or specificity by itself (recommended by Kuhn)
- auROC
- An F measure (harmonic mean of sensitivity and PPV)
- May need to think carefully about what is most important to you

Earlier, we saw that we got better sensitivity when we used balanced accuracy rather than accuracy to tune our hyperparameters

8.6.3 Sampling and Resampling to Address Class Imbalance

We can address issues of class imbalance either a priori or post-hoc with respect to data collection

A priori method would be to over-sample to get more of the minority class into your training set
Use targeted recruiting
Can be very costly or impossible in many instances
If possible, this can be much better than the resampling approach below
Post hoc, we can employ a variety of resampling procedures that are designed to make the training data more balanced
- We can up-sample the minority class
- We can down-sample the majority class
- We can synthesize new minority class observations e.g, SMOTE
For both a priori sampling or post-hoc resampling strategies, it is important that your test set is not manipulated. It should represent the expected distribution for the outcome, unaltered

8.6.4 Up-sampling

We resample minority class observations with replacement within our training set to increase the number of total observations of the minority class in the training set.
This simply duplicates existing minority class observations
Our test (or validation) set(s) should NOT be resampled. This is handled well by step_upsample()

Let’s apply this in our example

Up-sampling is part of feature engineering recipe
Need to specify the outcome (disease)
Can set over_ratio to values other than 1 if desired
Makes sense to do this after missing data imputation and dummy coding
Makes sense to do this before normalizing features
These steps are in the themis package rather than recipes (can use namespace or load full library)

rec_up <- recipe(disease ~ ., data = data_trn) |> 
  step_impute_median(all_numeric_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |>   
  step_dummy(all_nominal_predictors()) |> 
  themis::step_upsample(disease, over_ratio = 1) |> 
  step_normalize(all_numeric_predictors())

Need to re-tune model b/c the sample size has changed

Need to refit the final model to all of train

fits_glmnet_up <- cache_rds(
  expr = {
    logistic_reg(penalty = tune(), 
                 mixture = tune()) |> 
      set_engine("glmnet") |> 
      set_mode("classification") |> 
      tune_grid(preprocessor = rec_up, 
                resamples = splits_boot, grid = grid_glmnet, 
                metrics = metric_set(bal_accuracy))
  },
  rerun = rerun_setting,
  dir = "cache/008/",
  file = "fit_glmnet_up")

Review hyperparameter plot

plot_hyperparameters(fits_glmnet_up, hp1 = "penalty", hp2 = "mixture", metric = "bal_accuracy", log_hp1 = TRUE)

Let’s fit this best model configuration to all of our training data and evaluate it in test

Note the use of NULL for new_data
step_upsample() is only applied to training/held-in but not held-out (truly new) data
bake() knows to use the training data provided during prep() if we specify NULL
Could have done this all along for baking training. Its sometimes faster but previously same result. Now its necessary!

rec_up_prep <- rec_up |> 
  prep(data_trn)

feat_trn_up <- rec_up_prep |> 
  bake(new_data = NULL)

Notice that disease is now balanced in the training data!

feat_trn_up |> 
  skim_all()

Data summary
Name	feat_trn_up
Number of rows	1522
Number of columns	18
_______________________
Column type frequency:
factor	1
numeric	17
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	n_unique	top_counts
disease	0	1	2	yes: 761, no: 761

Variable type: numeric

skim_variable	complete_rate	sd	p0	p25	p50	p75	p100	skew	kurtosis
age	1	1	-2.82	-0.70	0.08	0.75	2.42	-0.31	-0.53
rest_bp	1	1	-2.19	-0.73	-0.17	0.40	3.32	0.55	0.40
chol	1	1	-2.36	-0.72	-0.08	0.64	6.11	1.03	3.51
exer_max_hr	1	1	-2.63	-0.74	0.11	0.71	2.25	-0.49	-0.23
exer_st_depress	1	1	-0.94	-0.94	-0.23	0.49	4.05	1.07	0.84
ca	1	1	-0.74	-0.74	-0.74	0.35	2.53	1.16	0.25
sex_male	1	1	-1.49	-1.49	0.67	0.67	0.67	-0.82	-1.33
cp_atyp_ang	1	1	-0.45	-0.45	-0.45	-0.45	2.24	1.79	1.21
cp_non_anginal	1	1	-1.76	0.57	0.57	0.57	0.57	-1.20	-0.57
fbs_yes	1	1	-0.42	-0.42	-0.42	-0.42	2.40	1.98	1.93
rest_ecg_wave_abn	1	1	-0.12	-0.12	-0.12	-0.12	8.45	8.33	67.40
rest_ecg_ventric_hypertrophy	1	1	-0.96	-0.96	-0.96	1.04	1.04	0.08	-1.99
exer_ang_yes	1	1	-0.68	-0.68	-0.68	1.48	1.48	0.80	-1.37
exer_st_slope_flat	1	1	-0.97	-0.97	-0.97	1.03	1.03	0.05	-2.00
exer_st_slope_downslope	1	1	-0.23	-0.23	-0.23	-0.23	4.36	4.13	15.06
thal_fixeddefect	1	1	-0.20	-0.20	-0.20	-0.20	5.02	4.82	21.25
thal_reversabledefect	1	1	-0.82	-0.82	-0.82	1.22	1.22	0.40	-1.84

fit_glmnet_up <-   
  logistic_reg(penalty = select_best(fits_glmnet_up)$penalty, 
               mixture = select_best(fits_glmnet_up)$mixture) |> 
  set_engine("glmnet") |> 
  set_mode("classification") |> 
  fit(disease ~ ., data = feat_trn_up)

To evaluate this model, we now need test features too

- IMPORTANT: Test is NOT up-sampled - bake it as new data!

::: {.cell}

```{.r .cell-code} feat_test <- rec_up_prep |> bake(data_test)

feat_test |> skim_all() ```

::: {.cell-output-display}

Table: Data summary

| | | |:————————|:———| |Name |feat_test | |Number of rows |428 | |Number of columns |18 | |_______________________ | | |Column type frequency: | | |factor |1 | |numeric |17 | |________________________ | | |Group variables |None |

Variable type: factor

|skim_variable | n_missing| complete_rate| n_unique|top_counts | |:————-|———:|————-:|——–:|:—————-| |disease | 0| 1| 2|no: 381, yes: 47 |

Variable type: numeric

|skim_variable | n_missing| complete_rate| mean| sd| p0| p25| p50| p75| p100| skew| kurtosis| |:—————————-|———:|————-:|—–:|—-:|—–:|—–:|—–:|—–:|—-:|—–:|——–:| |age | 0| 1| -0.07| 1.06| -2.26| -1.03| -0.14| 0.75| 2.53| 0.14| -0.75| |rest_bp | 0| 1| -0.20| 0.91| -2.19| -0.73| -0.17| 0.40| 3.78| 0.69| 1.13| |chol | 0| 1| -0.10| 1.02| -2.36| -0.76| -0.27| 0.37| 6.11| 1.99| 8.98| |exer_max_hr | 0| 1| 0.22| 0.89| -3.35| -0.23| 0.37| 0.88| 1.91| -0.73| 0.32| |exer_st_depress | 0| 1| -0.26| 0.89| -0.94| -0.94| -0.58| 0.15| 4.59| 1.69| 3.29| |ca | 0| 1| -0.33| 0.82| -0.74| -0.74| -0.74| -0.74| 2.53| 2.05| 3.44| |sex_male | 0| 1| -0.20| 1.06| -1.49| -1.49| 0.67| 0.67| 0.67| -0.39| -1.85| |cp_atyp_ang | 0| 1| 0.09| 1.08| -0.45| -0.45| -0.45| -0.45| 2.24| 1.49| 0.21| |cp_non_anginal | 0| 1| -0.11| 1.06| -1.76| -1.76| 0.57| 0.57| 0.57| -0.92| -1.15| |fbs_yes | 0| 1| -0.05| 0.95| -0.42| -0.42| -0.42| -0.42| 2.40| 2.18| 2.77| |rest_ecg_wave_abn | 0| 1| 0.00| 1.01| -0.12| -0.12| -0.12| -0.12| 8.45| 8.24| 66.02| |rest_ecg_ventric_hypertrophy | 0| 1| -0.08| 0.99| -0.96| -0.96| -0.96| 1.04| 1.04| 0.23| -1.95| |exer_ang_yes | 0| 1| -0.22| 0.88| -0.68| -0.68| -0.68| -0.68| 1.48| 1.40| -0.04| |exer_st_slope_flat | 0| 1| -0.23| 0.97| -0.97| -0.97| -0.97| 1.03| 1.03| 0.53| -1.72| |exer_st_slope_downslope | 0| 1| 0.10| 1.19| -0.23| -0.23| -0.23| -0.23| 4.36| 3.29| 8.83| |thal_fixeddefect | 0| 1| 0.01| 1.02| -0.20| -0.20| -0.20| -0.20| 5.02| 4.70| 20.11| |thal_reversabledefect | 0| 1| -0.32| 0.88| -0.82| -0.82| -0.82| -0.82| 1.22| 1.17| -0.64|

::: :::

Let’s see how this model performs in test

cm_up <- 
  tibble(truth = feat_test$disease,
         estimate = predict(fit_glmnet_up, feat_test)$.pred_class) |> 
  conf_mat(truth, estimate)

cm_up |> 
  summary()

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.808
 2 kap                  binary         0.393
 3 sens                 binary         0.830
 4 spec                 binary         0.806
 5 ppv                  binary         0.345
 6 npv                  binary         0.975
 7 mcc                  binary         0.451
 8 j_index              binary         0.636
 9 bal_accuracy         binary         0.818
10 detection_prevalence binary         0.264
11 precision            binary         0.345
12 recall               binary         0.830
13 f_meas               binary         0.487

preds_up <- tibble(truth = feat_test$disease,
                prob = predict(fit_glmnet_up, feat_test, type = "prob")$.pred_yes)

ggplot(data = preds_up, aes(x = prob)) + 
   geom_histogram(bins = 15) +
   facet_wrap(vars(truth), nrow = 2) +
   xlab("Pr(Disease)")

8.6.5 Down-sampling

We resample majority class observations within our training set to decrease/match the number of total observations of the minority class in the training set.

This selects a subset of the majority class
Our test (or validation) set(s) should NOT be resampled. This is handled well by step_downsample()

Down-sampling is part of feature engineering recipe

Need to specify the outcome (disease)
Can set under_ratio to values other than 1 if desired
Makes sense to do this after missing data imputation and dummy coding
Makes sense to do this before normalizing features

rec_down <- recipe(disease ~ ., data = data_trn) |> 
  step_impute_median(all_numeric_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |>   
  step_dummy(all_nominal_predictors()) |> 
  themis::step_downsample(disease, under_ratio = 1) |> 
  step_normalize(all_numeric_predictors())

Need to re-tune model b/c the sample size has changed

Need to refit the final model to all of train

fits_glmnet_down <- cache_rds(
  expr = {
    logistic_reg(penalty = tune(), 
                 mixture = tune()) |> 
      set_engine("glmnet") |> 
      set_mode("classification") |> 
      tune_grid(preprocessor = rec_down, 
                resamples = splits_boot, grid = grid_glmnet, 
                metrics = metric_set(bal_accuracy))
  },
  rerun = rerun_setting,
  dir = "cache/008/",
  file = "fits_glmnet_down")

Review hyperparameters

plot_hyperparameters(fits_glmnet_down, hp1 = "penalty", hp2 = "mixture", 
                     metric = "bal_accuracy", log_hp1 = TRUE)

Let’s fit this best model configuration to all of our training data and evaluate it in test

Note the use of NULL again when getting features for training data. Very important!
step_downsample() is not applied to baked data (See its default for skip = TRUE argument)
NOTE: sample size and ratio of yes/now for disease!

rec_down_prep <- rec_down |> 
  prep(data_trn)

feat_trn_down <- rec_down_prep |> 
  bake(new_data = NULL)

feat_trn_down |> skim_some()

Data summary
Name	feat_trn_down
Number of rows	184
Number of columns	18
_______________________
Column type frequency:
factor	1
numeric	17
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
disease	0	1	FALSE	2	yes: 92, no: 92

Variable type: numeric

skim_variable	complete_rate	p0	p100
age	1	-2.85	2.16
rest_bp	1	-1.89	3.50
chol	1	-2.52	3.34
exer_max_hr	1	-2.56	2.16
exer_st_depress	1	-0.87	4.24
ca	1	-0.74	2.46
sex_male	1	-1.55	0.64
cp_atyp_ang	1	-0.47	2.09
cp_non_anginal	1	-1.78	0.56
fbs_yes	1	-0.47	2.09
rest_ecg_wave_abn	1	-0.10	9.51
rest_ecg_ventric_hypertrophy	1	-0.99	1.01
exer_ang_yes	1	-0.64	1.55
exer_st_slope_flat	1	-0.84	1.19
exer_st_slope_downslope	1	-0.21	4.68
thal_fixeddefect	1	-0.17	5.97
thal_reversabledefect	1	-0.79	1.26

Now fit the model to these downsampled training data

fit_glmnet_down <-   
  logistic_reg(penalty = select_best(fits_glmnet_down)$penalty, 
               mixture = select_best(fits_glmnet_down)$mixture) |> 
  set_engine("glmnet") |> 
  set_mode("classification") |> 
  fit(disease ~ ., data = feat_trn_down)

Let’s see how this model performs in test

First we need test features
IMPORTANT: Test is NOT down-sampled

feat_test <- rec_down_prep |> 
  bake(data_test)

feat_test |> skim_some()

Data summary
Name	feat_test
Number of rows	428
Number of columns	18
_______________________
Column type frequency:
factor	1
numeric	17
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
disease	0	1	FALSE	2	no: 381, yes: 47

Variable type: numeric

skim_variable	complete_rate	p0	p100
age	1	-2.29	2.49
rest_bp	1	-2.25	3.97
chol	1	-2.52	6.30
exer_max_hr	1	-3.27	1.83
exer_st_depress	1	-0.87	4.78
ca	1	-0.74	2.46
sex_male	1	-1.55	0.64
cp_atyp_ang	1	-0.47	2.09
cp_non_anginal	1	-1.78	0.56
fbs_yes	1	-0.47	2.09
rest_ecg_wave_abn	1	-0.10	9.51
rest_ecg_ventric_hypertrophy	1	-0.99	1.01
exer_ang_yes	1	-0.64	1.55
exer_st_slope_flat	1	-0.84	1.19
exer_st_slope_downslope	1	-0.21	4.68
thal_fixeddefect	1	-0.17	5.97
thal_reversabledefect	1	-0.79	1.26

Get metrics in test

cm_down <- 
  tibble(truth = feat_test$disease,
         estimate = predict(fit_glmnet_down, feat_test)$.pred_class) |> 
  conf_mat(truth, estimate)

cm_down |> 
  summary()

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.780
 2 kap                  binary         0.357
 3 sens                 binary         0.851
 4 spec                 binary         0.772
 5 ppv                  binary         0.315
 6 npv                  binary         0.977
 7 mcc                  binary         0.426
 8 j_index              binary         0.623
 9 bal_accuracy         binary         0.811
10 detection_prevalence binary         0.297
11 precision            binary         0.315
12 recall               binary         0.851
13 f_meas               binary         0.460

And plot faceted probabilites

preds_down <- tibble(truth = feat_test$disease,
                prob = predict(fit_glmnet_down, feat_test, type = "prob")$.pred_yes)

ggplot(data = preds_down, aes(x = prob)) + 
   geom_histogram(bins = 15) +
   facet_wrap(vars(truth), nrow = 2) +
   xlab("Pr(Disease)")

8.6.6 SMOTE

A third approach to resampling is called the synthetic minority over-sampling technique (SMOTE)

To up-sample the minority class, SMOTE synthesizes new observations.

To do this, an observation is randomly selected from the minority class.
This observation’s K-nearest neighbors (KNNs) are then determined.
The new synthetic observation retains the outcome but a random combination of the predictors values from the randomly selected observation and its neighbors.

This is easily implemented by recipe in tidymodels using step_smote()

Need to specify the outcome (disease)
Can set over_ratio to values other than 1 (default) if desired
Can set neighbors to values other than 5 (default) if desired
Makes sense to do this after missing data imputation and dummy coding
Other features will need to be scaled/range-corrected prior to use (for distance)
Makes sense to do this before normalizing features for glmnet, etc

rec_smote <- recipe(disease ~ ., data = data_trn) |> 
  step_impute_median(all_numeric_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |>   
  step_dummy(all_nominal_predictors()) |>
  step_range(all_predictors()) |> 
  themis::step_smote(disease, over_ratio = 1, neighbors = 5) |> 
  step_normalize(all_numeric_predictors())

Need to re-tune model b/c the sample size has changed

Need to refit the final model to all of train

fits_glmnet_smote <- cache_rds(
  expr = {
    logistic_reg(penalty = tune(), 
                 mixture = tune()) |> 
      set_engine("glmnet") |> 
      set_mode("classification") |> 
      tune_grid(preprocessor = rec_smote, 
                resamples = splits_boot, grid = grid_glmnet, 
                metrics = metric_set(bal_accuracy))
  },
  rerun = rerun_setting,
  dir = "cache/008/",
  file = "fits_glmnet_smote")

Review hyperparameters

plot_hyperparameters(fits_glmnet_smote, hp1 = "penalty", hp2 = "mixture", 
                     metric = "bal_accuracy", log_hp1 = TRUE)

Let’s fit this best model configuration to all of our training data and evaluate it in test

Note the use of NULL (once again!) for new_data
step_smote() is not applied to new data
NOTE: sample size and outcome balance

rec_smote_prep <- rec_smote |> 
  prep(data_trn)

feat_trn_smote <- rec_smote_prep |> 
  bake(NULL)

feat_trn_smote |> skim_some()

Data summary
Name	feat_trn_smote
Number of rows	1522
Number of columns	18
_______________________
Column type frequency:
factor	1
numeric	17
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
disease	0	1	FALSE	2	yes: 761, no: 761

Variable type: numeric

skim_variable	complete_rate	p0	p100
age	1	-2.97	2.57
rest_bp	1	-2.38	3.72
chol	1	-2.51	6.44
exer_max_hr	1	-2.84	2.42
exer_st_depress	1	-1.00	4.71
ca	1	-0.78	2.73
sex_male	1	-1.60	0.64
cp_atyp_ang	1	-0.45	2.29
cp_non_anginal	1	-1.81	0.57
fbs_yes	1	-0.42	2.52
rest_ecg_wave_abn	1	-0.11	11.37
rest_ecg_ventric_hypertrophy	1	-0.98	1.05
exer_ang_yes	1	-0.68	1.50
exer_st_slope_flat	1	-0.97	1.06
exer_st_slope_downslope	1	-0.24	4.71
thal_fixeddefect	1	-0.19	5.75
thal_reversabledefect	1	-0.89	1.18

Fit model to smote sampled training data

fit_glmnet_smote <-   
  logistic_reg(penalty = select_best(fits_glmnet_smote)$penalty, 
               mixture = select_best(fits_glmnet_smote)$mixture) |> 
  set_engine("glmnet") |> 
  set_mode("classification") |> 
  fit(disease ~ ., data = feat_trn_smote)

Let’s see how this model performs in test

IMPORTANT: Test is NOT SMOTE up-sampled

feat_test <- rec_smote_prep |> 
  bake(data_test)

cm_smote <- 
  tibble(truth = feat_test$disease,
         estimate = predict(fit_glmnet_smote, feat_test)$.pred_class) |> 
  conf_mat(truth, estimate)

cm_smote |> 
  summary()

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.811
 2 kap                  binary         0.390
 3 sens                 binary         0.809
 4 spec                 binary         0.811
 5 ppv                  binary         0.345
 6 npv                  binary         0.972
 7 mcc                  binary         0.443
 8 j_index              binary         0.620
 9 bal_accuracy         binary         0.810
10 detection_prevalence binary         0.257
11 precision            binary         0.345
12 recall               binary         0.809
13 f_meas               binary         0.484

Plot faceted predicted probabilites

preds_smote <- tibble(truth = feat_test$disease,
                prob = predict(fit_glmnet_smote, feat_test, type = "prob")$.pred_yes)

ggplot(data = preds_smote, aes(x = prob)) + 
   geom_histogram(bins = 15) +
   facet_wrap(vars(truth), nrow = 2) +
   xlab("Pr(Disease)")

8.7 Discussion

8.7.1 Announcements

Exams by the end of the weekend
Conceptual exam often curved a bit
Lectures released tonight and by end of tomorrow
Installing keras package - stay tuned

8.7.2 General

I’m not understanding the difference between tuning the model to create a fit object, and then creating a separate fit object to evaluate it using the test data. I know this is fundamental but it somehow has slipped through my understanding.
- What is “tuning”
- How do we tune hyperparameters and then evaluate best model config? Using valiation + test. Using 10-fold + test

8.7.3 Confusion matrix

NYTimes article
- new blood test for colon cancer
- 53,000 die per year (more details)
- test results: detected 87%, FPR was 10%
- How well work for 55- 59 yo
- What if applied to 30 year old
More generally
- define accuracy, sens, spec, ppv, npv, bal_accuracy
- Cost of FP and FN?
- impact of decision threshold?

8.7.4 ROC

Axes and various names/representations
Why is top right perfect performance
thresholds
interpretation of auROC and the diagonal line
Can we go over some more concrete (real-world) examples of when it would be a good idea to use a different threshold for classification?
- I would like to understand more about adjusting decision thresholds across a few more application contexts. Also, are there any techniques to optimize decision thresholds? Or is it trial and error.

library(tidyverse)
theme_set(theme_classic()) 
devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/format_path.R?raw=true")

ℹ SHA-1 hash of file is "a58e57da996d1b70bb9a5b58241325d6fd78890f"

path_models <- format_path("studydata/risk/models/ema")
preds_hour<- read_rds(file.path(path_models, 
                               "outer_preds_1hour_0_v5_nested_main.rds"))
roc_hour <- preds_hour|> 
  yardstick::roc_curve(prob_beta, truth = label)

Option 1 - TPR vs. FPR

sensible axes
Less common terms

roc_hour |> 
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
    geom_path(linewidth = 1.25) +
    geom_abline(lty = 3) +
    coord_fixed(xlim = c(0, 1), ylim = c(0, 1)) +
    labs(x = "False Positive Rate (FPR)",
        y = "True Positive Rate (TPR)") +
  scale_x_continuous(breaks = seq(0,1,.25))

Option 2 - Sensitivity vs. 1 - Specificity

More common terms
1 - Specificity is confusing!

roc_hour |> 
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
    geom_path(linewidth = 1.25) +
    geom_abline(lty = 3) +
    coord_fixed(xlim = c(0, 1), ylim = c(0, 1)) +
    labs(x = "(1 - Specificity)",
        y = "Sensitivity") +
  scale_x_continuous(breaks = seq(0,1,.25))

Option 3

More common terms
Sensible (though reversed) axis
My preferred format

roc_hour |> 
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
    geom_path(linewidth = 1.25) +
    geom_abline(lty = 3) +
    coord_fixed(xlim = c(0, 1), ylim = c(0, 1)) +
    labs(x = "Specificity",
        y = "Sensitivity") +
  scale_x_continuous(breaks = seq(0,1,.25),
    labels = sprintf("%.2f", seq(1,0,-.25)))

8.7.5 Resampling

similarities and differences
resampling held-in but not held-out
Is there a limit on how much you should up/down-sample or SMOTE? In other words, is there a threshold of imbalance at which it’s not helpful to use one of those techniques, since you are just putting in “fake” data? (Or is “fake data” a bad/incorrect way to think about it?)

8.7.6 Kappa

expected accuracy
- given majority class proportion
- by chance given base rates
ratio of improvement above expected accuracy

Kuhn, Max, and Kjell Johnson. 2018. Applied Predictive Modeling. 1st ed. 2013, Corr. 2nd printing 2018 edition. New York: Springer.

Landis, J. R., and G. G. Koch. 1977. “The Measurement of Observer Agreement for Categorical Data.” Biometrics 33 (1): 159–74.