lab_unit_08

options(conflicts.policy = "depends.ok")
devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_ml.R?raw=true")

ℹ SHA-1 hash of file is "77e91675366f10788c6bcb59fa1cfc9ee0c75281"

tidymodels_conflictRules()

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.7     ✔ rsample      1.2.1
✔ dials        1.3.0     ✔ tune         1.2.1
✔ infer        1.0.7     ✔ workflows    1.1.4
✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
✔ parsnip      1.2.1     ✔ yardstick    1.3.2
✔ recipes      1.1.0     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org

library(ggplot2)
library(xfun, include.only = "cache_rds")

devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_plots.R?raw=true")

ℹ SHA-1 hash of file is "def6ce26ed7b2493931fde811adff9287ee8d874"

devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true")

ℹ SHA-1 hash of file is "c045eee2655a18dc85e715b78182f176327358a7"

theme_set(theme_classic())
options(tibble.width = Inf, dplyr.print_max=Inf)

function, vis, eda

CM, `autoplot`

I haven’t started the coding assignment yet (because of the 710 midterm this week), but I’d like to know if autoplot is only used to plot confusion matrices or if we can use it to plot other things? And if so, what? - We can also use autoplot for simple plotting of appropriate data type

?autoplot

Help on topic 'autoplot' was found in the following packages:

  Package               Library
  tune                  C:/Users/Patron/AppData/Local/R/win-library/4.4
  ggplot2               C:/Users/Patron/AppData/Local/R/win-library/4.4
  workflowsets          C:/Users/Patron/AppData/Local/R/win-library/4.4
  parsnip               C:/Users/Patron/AppData/Local/R/win-library/4.4


Using the first match ...

# in ggplot2
# autoplot: Create a complete ggplot appropriate to a particular data type

Maybe provide some functions we can play around about modifying confusion matrix to make it publication quality?

# Sample multi-class data
set.seed(421)
df <- tibble(
  truth = factor(sample(c("A", "B", "C", "D"), 150, replace = TRUE)),  # True Labels
  predicted = factor(sample(c("A", "B", "C", "D"), 150, replace = TRUE))  # Model Predictions
)

# Compute confusion matrix
cm <- conf_mat(df, truth, predicted)

# Default autoplot
autoplot(cm)

We can modify the color, label and fonts, etc.

autoplot(cm, type = "heatmap") +
  scale_fill_gradient(low = "yellow", high = "blue") +  # Custom color scale
  theme_minimal() +
  labs(title = "Confusion Matrix",
       fill = "Count") +  # Adjust labels
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"))

Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

predictor variance

Scale of variance: In the homework there is a question about variance. I had trouble understanding that and would like to hear more about it. It was not easy to tell visually whether variance was an issue for the predictors because some predictors appeared to have a wide spread but the scale was only like 0.1 from min to max. Others had what appeared to be a very compressed spread but the scale was in the hundreds. Do we consider absolute variance or the relative variance here?

grid plot

I would also be open to looking over how we can make our glmnet hyperparameter graphs look better. It was kind of confusing to fix.
perhaps a way to visualize the best hyperparameter indices for resampling.

John’s function to plot the performance as y, a hyperparamter as the x and another hyperparameter as colorful curves could provide an overall trend of the perforamance when tuning the hyperparameters.

Another choice might be using a heatmap.

# create a fake df
set.seed(123)  # Ensure reproducibility

# Generate a fake dataset
n <- 100  # Number of samples
p <- 20   # Number of features

# Generate informative features (correlated with outcome)
X <- as.data.frame(matrix(rnorm(n * p), nrow = n, ncol = p))
colnames(X) <- paste0("V", 1:p)  # Assign column names

# Create a linear combination of some features as the signal
signal <- 0.8 * X$V1 + 0.6 * X$V2 - 0.5 * X$V3 + 0.4 * X$V4

# Generate probabilities using logistic function
prob <- 1 / (1 + exp(-signal))  # Sigmoid function

# Assign binary labels based on probability
y <- rbinom(n, 1, prob) |> 
  as.factor()

# Combine into a dataframe
df <- cbind(X, y = y)

# Print first few rows
head(df)

           V1          V2         V3         V4          V5          V6
1 -0.56047565 -0.71040656  2.1988103 -0.7152422 -0.07355602 -0.60189285
2 -0.23017749  0.25688371  1.3124130 -0.7526890 -1.16865142 -0.99369859
3  1.55870831 -0.24669188 -0.2651451 -0.9385387 -0.63474826  1.02678506
4  0.07050839 -0.34754260  0.5431941 -1.0525133 -0.02884155  0.75106130
5  0.12928774 -0.95161857 -0.4143399 -0.4371595  0.67069597 -1.50916654
6  1.71506499 -0.04502772 -0.4762469  0.3311792 -1.65054654 -0.09514745
           V7         V8         V9        V10         V11        V12
1  1.07401226 -0.7282191  0.3562833 -1.0141142 -0.99579872  0.9159921
2 -0.02734697 -1.5404424 -0.6580102 -0.7913139 -1.03995504  0.8006224
3 -0.03333034 -0.6930946  0.8552022  0.2995937 -0.01798024 -0.9365690
4 -1.51606762  0.1188494  1.1529362  1.6390519 -0.13217513 -1.4007874
5  0.79038534 -1.3647095  0.2762746  1.0846170 -2.54934277  0.1602775
6 -0.21073418  0.5899827  0.1441047 -0.6245675  1.04057346 -0.2739624
         V13        V14        V15        V16        V17        V18        V19
1  0.6198501 -0.7497258 -1.0861182 -0.8209867 -0.2890233 -0.1925602 -1.2893642
2 -0.7575102 -0.3216061 -0.6653028 -0.3072572  0.6565134 -0.4697965 -0.6545686
3  0.8515247 -1.1477708  0.7148484 -0.9020980 -0.4539977 -3.0478609 -0.0573241
4 -0.7479300  0.3543522 -0.4316611  0.6270687 -0.5938646  1.8686555  1.2567478
5  0.6302398  0.4247998  0.2276149  1.1203550 -1.7103797  1.7904242  1.5874541
6  1.0966616  0.6483474  1.2949458  2.1272136 -0.2094484 -1.1010817  0.3194815
          V20 y
1  1.53732754 0
2 -0.45577106 1
3 -0.03265845 1
4  1.63675735 1
5 -0.32904197 0
6 -2.60403817 1

set.seed(12345)

splits_test <- df |>  
  initial_split(prop = 3/4, strata = "y")

d_trn <- splits_test |> 
  analysis()

d_test <- splits_test |> 
  assessment()

# set up resampling and hp grid
set.seed(12345)

splits_boot <- d_trn |> 
  bootstraps(times = 10, strata = "y") 

grid_glmnet <- expand_grid(penalty = exp(seq(-6, 3, length.out = 100)),
                           mixture = seq(0, 1, length.out = 6))

rec <- recipe(y ~ ., data = d_trn)

fits_glmnet <- cache_rds(
  expr = {
    logistic_reg(penalty = tune(), 
                 mixture = tune()) |> 
      set_engine("glmnet") |> 
      set_mode("classification") |> 
      tune_grid(preprocessor = rec, 
                resamples = splits_boot, grid = grid_glmnet, 
                metrics = metric_set(accuracy))
  },
  dir = "cache/",
  file = "fits_glmnet_lab_8",
  rerun = F)

plot_hyperparameters(fits_glmnet, hp1 = "penalty", hp2 = "mixture", 
                     metric = "accuracy", log_hp1 = TRUE)

metric_data <- tune::collect_metrics(fits_glmnet)

metric_data |> 
  ggplot(aes(x = log(penalty), y = mixture, fill = mean)) +
  geom_tile() +
  scale_fill_gradient()

facet_wrap

Can we go over the type of arguments that go into facet_wrap? In histograms?

# Create a fake dataset
set.seed(123)
df <- data.frame(
  category = rep(c("A", "B", "C"), each = 100),
  value = c(rnorm(100, mean = 50, sd = 10),
            rnorm(100, mean = 70, sd = 15),
            rnorm(100, mean = 30, sd = 5))
)

# Plot histograms with facet_wrap()
ggplot(df, aes(x = value)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  facet_wrap(~ category) +
  theme_bw()

function construction

I still confused about building a function in R.

# define a function
demo_fun <- function(x = 10, y = 10){
  # some operations on the inputs
  z1 <- x * y
  z2 <- x + y
  
  # return outputs
  return(c(z1, z2))
}

demo_fun()

[1] 100  20

# 100 20
demo_fun(5, 5)

[1] 25 10

# 25 10

function conflict

Why do I still need to specify function conflicts even though I have only libraried tidyverse and tidymodel. I also included John’s function. The one function which has the conflict is the select().

Maybe load the conflict rule polices before loading any packages.

regularization, tuning

grid

I would like to know more about how to specify the numeric values for the penalty values that are inside the expand_grid function. What does it mean to have seq(-8, 3, length.out = 200)?

This means to generate a sequence from -8 to 3, including 200 values between them.

?seq

Help on topic 'seq' was found in the following packages:

  Package               Library
  timeDate              C:/Users/Patron/AppData/Local/R/win-library/4.4
  base                  C:/PROGRA~1/R/R-44~1.2/library


Using the first match ...

# ## Default S3 method:
# seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)), length.out = NULL, along.with = NULL, ...)

# length.out    
# desired length of the sequence. A non-negative number, which for seq and seq.int will be rounded up if fractional.

balanced sampling

over/under_ratio

In downsampling what does under_ratio = 1 mean? What would different values mean. Is it just the number of samples?

It means the ratio of the #majority/#minority

?themis::step_downsample

# under_ratio   
# A numeric value for the ratio of the majority-to-minority frequencies. 
# The default value (1) means that all other levels are sampled down to have the same frequency as the least occurring level. 
# A value of 2 would mean that the majority levels will have (at most) (approximately) twice as many rows than the minority level.

How does change over_ratio to some values other than 1 make a difference?

?themis::step_upsample

# over_ratio    
# A numeric value for the ratio of the minority-to-majority frequencies. 
# The default value (1) means that all other levels are sampled up to have the same frequency as the most occurring level. 
# A value of 0.5 would mean that the minority levels will have (at most) (approximately) half as many rows than the majority level.

examples to implement up/down-sampling

Upsampling and downsampling; more examples of how this would be implemented.
How to use the different resampling techniques to address uneven class distribution? examples for the same.
Can you talk more on the step_upsample()?

We use them in the recipe by calling themis::step_upsample or themis::step_downsample

details of up/down-sampling and SMOTE

The underlying procedure of step_downsample and step_upsample. Why we use this function after step_dummy and before step_normalize in recipe?

We randomly remove/sample from existing instances of the majority/minority class. And the order is intended to maintain the pattern of the original data, such as the overall status of the distribution of the feature values.

I think I would like to look more at the smote technique for resampling. I find this topic a little bit confusing.

It’s a special upsampling for minority class, where the upsampled instances are synthesized from the feature values limited by the existing instances.

Implementing SMOTE for class imbalance correctly while avoiding data leakage.

Only prepare the recipe with trianing data.

How do we correctly implement up-sampling, down-sampling, and SMOTE in tidymodels, ensuring that resampling affects only the training set and not the test set?

Check out the key.

What are the key considerations when tuning models on resampled data?

metrics

ROC

What do the arguments for building the ROC plot mean?

roc_plot <- 
  tibble(truth = feat_test$diagnosis,
         prob = predict(fit_best_knn, feat_test, type = "prob")$.pred_malignant) |> 
  roc_curve(prob, truth = truth)

roc_plot |>
  ggplot(aes(x = 1 - specificity, y = sensitivity, color = .threshold)) +
  geom_path() +
  geom_abline(lty = 3) +
  coord_equal() +
  labs(x = "1 - Specificity (FPR)",
       y = "Sensitivity (TPR)")

In the dataframe, truth refers to the ground truth, and prob refers to the probability of the prediction to be positive. We pipes it into roc_curve to compute specificity and sensitivity.

Using and interpreting the results for ROC and auROC.

The ROC provides a more formal method to visualize the trade-offs between sensitivity and specificity across all possible thresholds for classification.

auROC is the probability that the classifier will rank/predict a randomly selected true positive observation higher than a randomly selected true negative observation. Alternatively, it can be thought of as the average sensitivity across all decision thresholds. It summarizes performance (sensitivity vs. specificity trade-off) across all possible thresholds and is not affected by class imbalances in contrast to many other metrics.

multiple metrics usage

metrics = metric_set(roc_auc, accuracy, sens, spec, bal_accuracy): I still don’t really understand why we run all of these metrics at once, is it just to see which one is the best? And if you can do that why would we not always include all of the metrics to see which one is most accurate apart from specific scenarios?

If we’re selecting a best model connfiguration, we should make sure the criterion is the same across all candidate configurations. And for different aims, we might use different metric as this criterion (for more comprehensive selection, there are also metrics like balanced accuracy). We assess the model on its universal performance using all those metrics.

parallel processing, cache

####specify parallel processing and cache - I am still unsure about parallel processing, do we simply include those 2 lines from the slides or do we need to do a similar process as caching where we wrap the expression in a code? - I am a little confused how parallel processing works, and if we have to integrate it into the other code chunks.

The parallel processing is a global setting for the script, so we only need to specify it at the beginning of the computation. Normally, computations run sequentially (one after another). With parallel processing, multiple tasks run simultaneously, leveraging multi-cores.

Cache is case-specific for chunks/computational steps that need a long time and may be repeated in the future, so it needs to be set for a specific process.