unit_10_lab

Author

Zihan Li

Keras

Bug

When using Keras to run neural networks, I get this warning for keras: some required packages prohibit parallel processing. I would like to know how to fix this issue.

This might be caused by the conflict with the innate parallelism of keras:

“keras/tensorflow implements it’s own parallelism and concurrency, and, in general, doesn’t play nice when other tools try to parallelize it by forking its process and/or assuming concurrently running keras training sessions will equitably share resources (e.g., GPU RAM) with each other on the same machine.” (From: https://github.com/rstudio/keras3/issues/1223)

I would hope to know why would the model failed for neural networking in classification while I already set the outcome variable as factor but the error mentioned it was a character. How can I fix the problem?

May need more information.

Are relu() and mlp() included in the keras package?

Yes, they’re innate functions in keras.

Usage

Also, what does it mean to use keras directly from R or Python?

how does tensorflow and keras work? Is it just like python but working in R?

Could we have used keras in python instead of doing the whole installation through R

I have never used python/virtual environments/etc. before so I really don’t understand what keras/tensorflow are for or what they are doing in the background but would love more info on how this is working!

Keras is a high-level neural network API (Application Programming Interface), and TensorFlow is the backend computation engine, which is integrated into a library for machine learning computation (especially useful of neural network in our case).

In R, we access Keras and TensorFlow using the keras and tensorflow packages, which wrap the actual Python libraries.

Under the hood, R still uses Python—that’s why setting up a virtual environment correctly matters.

If you’re seeing errors related to versions, it may be due to misaligned Python/TensorFlow/Keras versions.

Resampling

How do you decide the tradeoff between different forms of resampling?

When should we use validation splits vs things like bootstrap/k-fold in this context? I know that it has to do with the sample size and that bootstrapping/k-fold is a way to handle lower N, but at what point to we make that decision practically?

Yes, the size is very important. And there is always the bias-var trade-off, and no accurate static answer for the decision. And every user should consider their own computational/time resource in hand. Bootstrapping could be computationally tremendous, but it could also provide more validation data incase you create more resamples, which could help use appreciate the property of the distribution of model performance.

Tuning

How do you decide how to scale features?

For NN models, gradient descent works better with inputs on the same scale.

If I need to tune multiple hyper parameters, tuning all of them at the same time can be extremely computational costing. Can we break this into steps? Or we can just do parallel processing and cache?

Sure, using parallel processing and caching will help increase the efficiency in initial training and later reloading separately. Locally running the integrated grid does take a long time. It’s possible to break them into steps, like separating one dimension in the grid to be separate experiments (using the same grid without this dimension, and allocating different value of this dimension manually in the training syntaxes).

I’m still unsure how to develop the best neural networks, given the “black box”-ness of the hidden layers.

How do we decide on values to include in the tuning grid?

Is there suggested initial setup for tuning neural network model?

For specific NN model, there could be recommended configuration. Sometimes we train the model on the basis of pretraining, which is fine-tuning. For this training the range could be more specified.

what’s the best way to increase accuracy of our models?

It’s very general here. Of course when increasing the complexity of the models like adding more neurons or layers could usually grant more capability of variance to the model, but this would also exacerbate the risk of overfitting. Taking the variance-bias trade-off into account all the time, when tunning the grid. Maybe use more simplistic model when the performances are similar in validation comparison.

empirically decide but do not overcomplicate

start with simple models and then do experimentation

Is there a recommended order or combination for using dropout and L2 regularization together in the same model?

Not usable together in tidymodels workflow, but you could try using them at the same time with keras directly. And again, explore and experiment considering specific need.

I’m interested in how dropout and no. of epochs affects overfitting

Dropout randomly deactivates neurons during training, preventing them from contributing to forward or backward passes. This forces the network to develop redundant, independent representations, reducing reliance on specific neurons. As a result, the model becomes more robust and generalizes better, helping to prevent overfitting.

Number of Epochs means how many times we’re exposing the model to learn the properties of the training data. Too few epochs lead to inadequate learning and under-fitting, and in contrast, too many epochs will drive the model to carve itself in very detailed way to fit the data, which is overfitting.

How do we choose the appropriate number of epochs, and how can early stopping be effectively implemented in practice?

parameter callbacks in set_engine()

callback_list <- list(keras::callback_early_stopping(monitor = "val_loss", 
                                                     min_delta = 0, 
                                                     patience = 10))
                                                     
fit_early <- mlp(hidden_units = 30, activation = "relu", epochs = 200) |>
  set_mode("classification") |> 
  set_engine("keras", verbose = 1,
             seeds = fit_seeds, 
             metrics = c("accuracy" ), 
             validation_split = 1/6,
             callbacks = callback_list) |>
  fit(y ~ ., data = feat_trn)

How do we choose hyperparameter values for hidden units and the penalty? Is there a scaling that matters, like for penalty?

It could be helpful to go over how we actually go about tuning the hyperparameters using tune_grid() in tidymodels. What’s the best way to set up the grid?

Maybe need some review for the previous units? The basic format for a tuning grid will be a data frame containing all dimensions of the hyperparameter set, ranging respectively for all possible values in specific dimension.
grid <- expand_grid(hp1 = exp(seq(-8, 3, length.out = 200)), # some range
                    hp2 = seq(0, 1, length.out = 6),
                    hp3 = c("val1", "val2")) # some values, like activation functions
When experimenting with the configuration:

Range: wide→narrow

step: big→small

Generally, I wonder if there is a way to tell whether the seemingly unchanged accuracy after tweaking my recipe with steps such as step_normalize was due to the minimal impact of that step itself or my incorrect way of implementing that step in the recipe. Also, I wonder how to properly implement step_pca() while retaining all features as I failed to find many useful resources regarding this matter.

Some steps are not very strongly changing the model configuration. Mentioning the normalization for feature in this unit, though aligning the scale of the features could be helpful to the gradient descent, in case the initialization of the parameter searching is very ideal, it might also go well without a correct scaling.

PCA part - May need more information

What does that mean to implement PCA with keeping all feats?

Neural network

multi layer usage

I would like to focus on writing functions for Keras; I’m not that familiar with this tool in general.

Can you do a demonstration of how to fit multiple hidden layer neural network?

how can we deal with more than one hidden layer neural network algorithm in mlp?

https://cran.r-project.org/web/packages/keras/vignettes/guide_keras.html

https://cran.r-project.org/web/packages/keras/vignettes/index.html

Activation functions

I would like to spend more time learning about softmax activation.

What are the pros and cons of the different activation functions and what are they changing? (softmax, relu, elu)

https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html

Comparison

Function Typical Use Formula (Intuition)

ReLU (Rectified Linear Unit) Hidden layers (default) f(x) = max(0, x) → keeps positive values, zeros out negative ones

ELU (Exponential Linear Unit) Hidden layers (alternative) f(x) = x if x>0, else α*(exp(x)-1) → smoother than ReLU for negatives

Softmax Output layer (multi-class classification) Converts raw scores to probabilities that sum to 1

Function	Typical Use	Formula (Intuition)
ReLU (Rectified Linear Unit)	Hidden layers (default)	`f(x) = max(0, x)` → keeps positive values, zeros out negative ones
ELU (Exponential Linear Unit)	Hidden layers (alternative)	`f(x) = x if x>0, else α*(exp(x)-1)` → smoother than ReLU for negatives
Softmax	Output layer (multi-class classification)	Converts raw scores to probabilities that sum to 1

Demonstrations

library(ggplot2)
# ReLU function
relu <- function(x) pmax(0, x)
x <- seq(-5, 5, 0.1)
y_relu <- relu(x)
df_relu <- data.frame(x = x, y = y_relu)

ggplot(df_relu, aes(x = x, y = y)) +
  geom_line(color = "steelblue", size = 1.2) +
  geom_hline(yintercept = 0, color = "gray40") +
  geom_vline(xintercept = 0, color = "gray40") +
  labs(title = "ReLU Activation Function", x = "x", y = "ReLU(x)") +
  theme_minimal(base_size = 14)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

# ELU function
elu <- function(x, alpha = 1.0) {
  ifelse(x > 0, x, alpha * (exp(x) - 1))
}
y_elu <- elu(x)
df_elu <- data.frame(x = x, y = y_elu)

ggplot(df_elu, aes(x = x, y = y)) +
  geom_line(color = "darkorange", size = 1.2) +
  geom_hline(yintercept = 0, color = "gray40") +
  geom_vline(xintercept = 0, color = "gray40") +
  labs(title = "ELU Activation Function (α = 1)", x = "x", y = "ELU(x)") +
  theme_minimal(base_size = 14)

library(tidyr)

# Binary softmax demo: softmax([-x, x])
softmax <- function(x) {
  exp_x <- exp(x - max(x))
  exp_x / sum(exp_x)
}

x_vals <- seq(-4, 4, by = 0.1)

# Compute probabilities for both classes
prob_class1 <- sapply(x_vals, function(x) softmax(c(-x, x))[1])
prob_class2 <- 1 - prob_class1  # since binary

# Organize into a tidy dataframe
df <- data.frame(
  x = x_vals,
  class_1 = prob_class1,
  class_2 = prob_class2
) |> 
  pivot_longer(cols = c("class_1", "class_2"), names_to = "Class", values_to = "Probability")

# Plot
ggplot(df, aes(x = x, y = Probability, color = Class)) +
  geom_line(size = 1.2) +
  geom_hline(yintercept = 0, color = "gray40") +
  geom_vline(xintercept = 0, color = "gray40") +
  labs(
    title = "Softmax Output (2-Class Example)",
    x = "x",
    y = "Probability"
  ) +
  scale_color_manual(values = c("lightgreen", "darkgreen")) +
  theme_minimal(base_size = 14)

model comparison

Are neural networks more likely to have zero variance predictors than other machine learning methods?

zero variance predictors → feature having identical value for all samples. This is on the predictors themselves not the algorithm to learn them.

General

What are the trade-offs between efficiency and finding the best model?

Do experiments with considering you limitation and expectation on the whole training.

Applying techniques to reduce overfitting to the data

Very general … Regularization and controlling the complexity of the model configuration.

This is probably a beginning of semester topic, but I see we are creating feat_trn like feat_trn <- rec_prep |> bake(NULL). When do we bake with NULL (nothing) versus with data_trn.

Same.

not completely related but would be great if you could explain how to check if parallel processing is working properly

Maybe set up a tiny loop experiment to check out the time consumption using tictoc package.

GPU

Is it possible to use GPUs in R for Keras rather than CPUs since they could speed up parallel processing?

If you installed it on GPU it should detect automatically

It could be installed
pip install tensorflow-gpu

or configure the TF to use GPU

I tried to get keras working with my GPU, and have yet to succeed, but one of the things I stumbled into was using a WSL. I’ve dabbled a little bit with it, but hearing more about it (or having an appendix) that talks about the benefits of understanding would be nice.

Generally, we hope everyone could trying out some search strategies for the resources beyond the scope of this course. We could use them along the granularity scale to some extent.

LLMs (Generally ask for some assistance on “what/How … is/for” questions, or hit on some practical issue in your code bug. Sometimes issue will be solved even at this step.)

stackoverflow, GitHub, various forums (Specific questions people had brought up similarly)

package docs, online books, tutorials (People tried hard to compile some usable manuals to help their used through …)

And this pace is in cycling. That is, sometimes one solution to an issue means another search trying to solve related/latent problem.

Troubleshooting

What are some common reasons why a neural network might not train effectively? How can you use debugging techniques like printing intermediate outputs or checking gradient norms to troubleshoot the problem?

It is a very general concern in training. The performance could be affected by many reasons, from the quality of the input to the configuration details and property of regularizations. Focusing on the gradient descent, there is existing function in TensorFlow library for the gradient norm checking. See tf.GradientTape (https://www.tensorflow.org/api_docs/python/tf/GradientTape) for more information.