options(conflicts.policy = "depends.ok")
::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_ml.R?raw=true") devtools
ℹ SHA-1 hash of file is "77e91675366f10788c6bcb59fa1cfc9ee0c75281"
tidymodels_conflictRules()
options(conflicts.policy = "depends.ok")
::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_ml.R?raw=true") devtools
ℹ SHA-1 hash of file is "77e91675366f10788c6bcb59fa1cfc9ee0c75281"
tidymodels_conflictRules()
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom 1.0.7 ✔ rsample 1.2.1
✔ dials 1.3.0 ✔ tune 1.2.1
✔ infer 1.0.7 ✔ workflows 1.1.4
✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
✔ parsnip 1.2.1 ✔ yardstick 1.3.2
✔ recipes 1.1.0
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter() masks stats::filter()
✖ recipes::fixed() masks stringr::fixed()
✖ dplyr::lag() masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step() masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org
library(ggplot2)
library(xfun, include.only = "cache_rds")
::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_plots.R?raw=true") devtools
ℹ SHA-1 hash of file is "def6ce26ed7b2493931fde811adff9287ee8d874"
::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true") devtools
ℹ SHA-1 hash of file is "c045eee2655a18dc85e715b78182f176327358a7"
theme_set(theme_classic())
options(tibble.width = Inf, dplyr.print_max=Inf)
autoplot
autoplot
for simple plotting of appropriate data type ?autoplot
Help on topic 'autoplot' was found in the following packages:
Package Library
tune C:/Users/Patron/AppData/Local/R/win-library/4.4
ggplot2 C:/Users/Patron/AppData/Local/R/win-library/4.4
workflowsets C:/Users/Patron/AppData/Local/R/win-library/4.4
parsnip C:/Users/Patron/AppData/Local/R/win-library/4.4
Using the first match ...
# in ggplot2
# autoplot: Create a complete ggplot appropriate to a particular data type
# Sample multi-class data
set.seed(421)
<- tibble(
df truth = factor(sample(c("A", "B", "C", "D"), 150, replace = TRUE)), # True Labels
predicted = factor(sample(c("A", "B", "C", "D"), 150, replace = TRUE)) # Model Predictions
)
# Compute confusion matrix
<- conf_mat(df, truth, predicted)
cm
# Default autoplot
autoplot(cm)
We can modify the color, label and fonts, etc.
autoplot(cm, type = "heatmap") +
scale_fill_gradient(low = "yellow", high = "blue") + # Custom color scale
theme_minimal() +
labs(title = "Confusion Matrix",
fill = "Count") + # Adjust labels
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"))
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
John’s function to plot the performance as y, a hyperparamter as the x and another hyperparameter as colorful curves could provide an overall trend of the perforamance when tuning the hyperparameters.
Another choice might be using a heatmap.
# create a fake df
set.seed(123) # Ensure reproducibility
# Generate a fake dataset
<- 100 # Number of samples
n <- 20 # Number of features
p
# Generate informative features (correlated with outcome)
<- as.data.frame(matrix(rnorm(n * p), nrow = n, ncol = p))
X colnames(X) <- paste0("V", 1:p) # Assign column names
# Create a linear combination of some features as the signal
<- 0.8 * X$V1 + 0.6 * X$V2 - 0.5 * X$V3 + 0.4 * X$V4
signal
# Generate probabilities using logistic function
<- 1 / (1 + exp(-signal)) # Sigmoid function
prob
# Assign binary labels based on probability
<- rbinom(n, 1, prob) |>
y as.factor()
# Combine into a dataframe
<- cbind(X, y = y)
df
# Print first few rows
head(df)
V1 V2 V3 V4 V5 V6
1 -0.56047565 -0.71040656 2.1988103 -0.7152422 -0.07355602 -0.60189285
2 -0.23017749 0.25688371 1.3124130 -0.7526890 -1.16865142 -0.99369859
3 1.55870831 -0.24669188 -0.2651451 -0.9385387 -0.63474826 1.02678506
4 0.07050839 -0.34754260 0.5431941 -1.0525133 -0.02884155 0.75106130
5 0.12928774 -0.95161857 -0.4143399 -0.4371595 0.67069597 -1.50916654
6 1.71506499 -0.04502772 -0.4762469 0.3311792 -1.65054654 -0.09514745
V7 V8 V9 V10 V11 V12
1 1.07401226 -0.7282191 0.3562833 -1.0141142 -0.99579872 0.9159921
2 -0.02734697 -1.5404424 -0.6580102 -0.7913139 -1.03995504 0.8006224
3 -0.03333034 -0.6930946 0.8552022 0.2995937 -0.01798024 -0.9365690
4 -1.51606762 0.1188494 1.1529362 1.6390519 -0.13217513 -1.4007874
5 0.79038534 -1.3647095 0.2762746 1.0846170 -2.54934277 0.1602775
6 -0.21073418 0.5899827 0.1441047 -0.6245675 1.04057346 -0.2739624
V13 V14 V15 V16 V17 V18 V19
1 0.6198501 -0.7497258 -1.0861182 -0.8209867 -0.2890233 -0.1925602 -1.2893642
2 -0.7575102 -0.3216061 -0.6653028 -0.3072572 0.6565134 -0.4697965 -0.6545686
3 0.8515247 -1.1477708 0.7148484 -0.9020980 -0.4539977 -3.0478609 -0.0573241
4 -0.7479300 0.3543522 -0.4316611 0.6270687 -0.5938646 1.8686555 1.2567478
5 0.6302398 0.4247998 0.2276149 1.1203550 -1.7103797 1.7904242 1.5874541
6 1.0966616 0.6483474 1.2949458 2.1272136 -0.2094484 -1.1010817 0.3194815
V20 y
1 1.53732754 0
2 -0.45577106 1
3 -0.03265845 1
4 1.63675735 1
5 -0.32904197 0
6 -2.60403817 1
set.seed(12345)
<- df |>
splits_test initial_split(prop = 3/4, strata = "y")
<- splits_test |>
d_trn analysis()
<- splits_test |>
d_test assessment()
# set up resampling and hp grid
set.seed(12345)
<- d_trn |>
splits_boot bootstraps(times = 10, strata = "y")
<- expand_grid(penalty = exp(seq(-6, 3, length.out = 100)),
grid_glmnet mixture = seq(0, 1, length.out = 6))
<- recipe(y ~ ., data = d_trn)
rec
<- cache_rds(
fits_glmnet expr = {
logistic_reg(penalty = tune(),
mixture = tune()) |>
set_engine("glmnet") |>
set_mode("classification") |>
tune_grid(preprocessor = rec,
resamples = splits_boot, grid = grid_glmnet,
metrics = metric_set(accuracy))
},dir = "cache/",
file = "fits_glmnet_lab_8",
rerun = F)
plot_hyperparameters(fits_glmnet, hp1 = "penalty", hp2 = "mixture",
metric = "accuracy", log_hp1 = TRUE)
<- tune::collect_metrics(fits_glmnet)
metric_data
|>
metric_data ggplot(aes(x = log(penalty), y = mixture, fill = mean)) +
geom_tile() +
scale_fill_gradient()
# Create a fake dataset
set.seed(123)
<- data.frame(
df category = rep(c("A", "B", "C"), each = 100),
value = c(rnorm(100, mean = 50, sd = 10),
rnorm(100, mean = 70, sd = 15),
rnorm(100, mean = 30, sd = 5))
)
# Plot histograms with facet_wrap()
ggplot(df, aes(x = value)) +
geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
facet_wrap(~ category) +
theme_bw()
# define a function
<- function(x = 10, y = 10){
demo_fun # some operations on the inputs
<- x * y
z1 <- x + y
z2
# return outputs
return(c(z1, z2))
}
demo_fun()
[1] 100 20
# 100 20
demo_fun(5, 5)
[1] 25 10
# 25 10
Maybe load the conflict rule polices before loading any packages.
expand_grid
function. What does it mean to have seq(-8, 3, length.out = 200)
?This means to generate a sequence from -8 to 3, including 200 values between them.
?seq
Help on topic 'seq' was found in the following packages:
Package Library
timeDate C:/Users/Patron/AppData/Local/R/win-library/4.4
base C:/PROGRA~1/R/R-44~1.2/library
Using the first match ...
# ## Default S3 method:
# seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)), length.out = NULL, along.with = NULL, ...)
# length.out
# desired length of the sequence. A non-negative number, which for seq and seq.int will be rounded up if fractional.
It means the ratio of the #majority/#minority
::step_downsample
?themis
# under_ratio
# A numeric value for the ratio of the majority-to-minority frequencies.
# The default value (1) means that all other levels are sampled down to have the same frequency as the least occurring level.
# A value of 2 would mean that the majority levels will have (at most) (approximately) twice as many rows than the minority level.
::step_upsample
?themis
# over_ratio
# A numeric value for the ratio of the minority-to-majority frequencies.
# The default value (1) means that all other levels are sampled up to have the same frequency as the most occurring level.
# A value of 0.5 would mean that the minority levels will have (at most) (approximately) half as many rows than the majority level.
We use them in the recipe by calling
themis::step_upsample
orthemis::step_downsample
We randomly remove/sample from existing instances of the majority/minority class. And the order is intended to maintain the pattern of the original data, such as the overall status of the distribution of the feature values.
It’s a special upsampling for minority class, where the upsampled instances are synthesized from the feature values limited by the existing instances.
Only prepare the recipe with trianing data.
Check out the key.
<-
roc_plot tibble(truth = feat_test$diagnosis,
prob = predict(fit_best_knn, feat_test, type = "prob")$.pred_malignant) |>
roc_curve(prob, truth = truth)
|>
roc_plot ggplot(aes(x = 1 - specificity, y = sensitivity, color = .threshold)) +
geom_path() +
geom_abline(lty = 3) +
coord_equal() +
labs(x = "1 - Specificity (FPR)",
y = "Sensitivity (TPR)")
In the dataframe,
truth
refers to the ground truth, andprob
refers to the probability of the prediction to be positive. We pipes it into roc_curve to compute specificity and sensitivity.
The ROC provides a more formal method to visualize the trade-offs between sensitivity and specificity across all possible thresholds for classification.
auROC is the probability that the classifier will rank/predict a randomly selected true positive observation higher than a randomly selected true negative observation. Alternatively, it can be thought of as the average sensitivity across all decision thresholds. It summarizes performance (sensitivity vs. specificity trade-off) across all possible thresholds and is not affected by class imbalances in contrast to many other metrics.
metrics = metric_set(roc_auc, accuracy, sens, spec, bal_accuracy)
: I still don’t really understand why we run all of these metrics at once, is it just to see which one is the best? And if you can do that why would we not always include all of the metrics to see which one is most accurate apart from specific scenarios?If we’re selecting a best model connfiguration, we should make sure the criterion is the same across all candidate configurations. And for different aims, we might use different metric as this criterion (for more comprehensive selection, there are also metrics like balanced accuracy). We assess the model on its universal performance using all those metrics.
####specify parallel processing and cache - I am still unsure about parallel processing, do we simply include those 2 lines from the slides or do we need to do a similar process as caching where we wrap the expression in a code? - I am a little confused how parallel processing works, and if we have to integrate it into the other code chunks.
The parallel processing is a global setting for the script, so we only need to specify it at the beginning of the computation. Normally, computations run sequentially (one after another). With parallel processing, multiple tasks run simultaneously, leveraging multi-cores.
Cache is case-specific for chunks/computational steps that need a long time and may be repeated in the future, so it needs to be set for a specific process.