Code
library(tidyverse)
library(tidymodels)
devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true")When you have nominal/ordinal predictors that have levels that are infrequent, you will occasionally find that an infrequent level appears in your held out set (i.e., validation or test) but not in your training set. This can cause problems when you try to make predictions for these new values. Specifically, the feature values for this level will be set to NA and therefore, you will get predictions of NA for these observations.
In this appendix, we demonstrate this problem and our preferred solution given our workflow of classing all nominal/ordinal predictors as factors in our dataframes.
library(tidyverse)
library(tidymodels)
devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true")Make simple data sets with an outcome (y) and one nominal predictor (x). Note that x will have a novel value (foo) in the test set that wasnt present in the training set.
n <- 6
data_trn <- tibble(y = rnorm(n),
x = rep (c("a", "b", "c"), n/3)) |>
mutate(x = factor(x, levels = c("a", "b", "c"))) |>
print()# A tibble: 6 × 2
y x
<dbl> <fct>
1 1.16 a
2 -1.01 b
3 0.929 c
4 -0.494 a
5 0.0734 b
6 -0.580 c
data_test <- tibble(y = c(rnorm(n), rnorm(1)),
x = c(rep (c("a", "b", "c"), n/3), "foo")) |>
mutate(x = factor(x, levels = c("a", "b", "c", "foo"))) |>
print()# A tibble: 7 × 2
y x
<dbl> <fct>
1 0.836 a
2 0.412 b
3 -0.0199 c
4 -0.968 a
5 0.00296 b
6 -0.864 c
7 -0.306 foo
Make a recipe
rec <- recipe(y ~ x, data = data_trn) %>%
step_dummy(x)Prep the recipe with training data
rec_prep <- rec |>
prep(data_trn)Features for training set. No problems
feat_trn <- rec_prep |>
bake(NULL)
feat_trn |> skim_all()| Name | feat_trn |
| Number of rows | 6 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | skew | kurtosis |
|---|---|---|---|---|---|---|---|---|---|---|---|
| y | 0 | 1 | 0.01 | 0.88 | -1.01 | -0.56 | -0.21 | 0.71 | 1.16 | 0.22 | -1.93 |
| x_b | 0 | 1 | 0.33 | 0.52 | 0.00 | 0.00 | 0.00 | 0.75 | 1.00 | 0.54 | -1.96 |
| x_c | 0 | 1 | 0.33 | 0.52 | 0.00 | 0.00 | 0.00 | 0.75 | 1.00 | 0.54 | -1.96 |
Features for test set.
x in test. If we looked closer, we would see this is the observation for foofeat_test <- rec_prep |>
bake(data_test)Warning: ! There are new levels in `x`: foo.
ℹ Consider using step_novel() (`?recipes::step_novel()`) \ before
`step_dummy()` to handle unseen values.
feat_test |> skim_all()| Name | feat_test |
| Number of rows | 7 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | skew | kurtosis |
|---|---|---|---|---|---|---|---|---|---|---|---|
| y | 0 | 1.00 | -0.13 | 0.65 | -0.97 | -0.59 | -0.02 | 0.21 | 0.84 | 0.04 | -1.60 |
| x_b | 1 | 0.86 | 0.33 | 0.52 | 0.00 | 0.00 | 0.00 | 0.75 | 1.00 | 0.54 | -1.96 |
| x_c | 1 | 0.86 | 0.33 | 0.52 | 0.00 | 0.00 | 0.00 | 0.75 | 1.00 | 0.54 | -1.96 |
We handle this problem of potential new levels in held-out data by inserting step_novel() prior to step_dummy() in our recipe. This assigns all potential novel (unseen in training) levels to a new category called new by default
rec_novel <- recipe(y ~ x, data = data_trn) |>
step_novel(x) |>
step_dummy(x)When we now prep this recipe using training data that does not contain foo (our novel level we will find in test), everything is fine
rec_novel_prep <- rec_novel |>
prep(data_trn)When we bake features for training data, we see what step_novel() did. It added a new level and therefore a new feature to code the contrast of that level with the reference level. However, given that this new level was not present in our training data, all observations are assigned a zero for this new feature.
feat_trn_novel <- rec_novel_prep |>
bake(NULL)
feat_trn_novel |> bind_cols(data_trn |> select(x)) |> print()# A tibble: 6 × 5
y x_b x_c x_new x
<dbl> <dbl> <dbl> <dbl> <fct>
1 1.16 0 0 0 a
2 -1.01 1 0 0 b
3 0.929 0 1 0 c
4 -0.494 0 0 0 a
5 0.0734 1 0 0 b
6 -0.580 0 1 0 c
But now when we bake the test data, this new feature is set to 1 for observations associated with this new level foo
feat_test_novel <- rec_novel_prep |>
bake(data_test)
feat_test_novel |> bind_cols(data_test |> select(x)) |> print()# A tibble: 7 × 5
y x_b x_c x_new x
<dbl> <dbl> <dbl> <dbl> <fct>
1 0.836 0 0 0 a
2 0.412 1 0 0 b
3 -0.0199 0 1 0 c
4 -0.968 0 0 0 a
5 0.00296 1 0 0 b
6 -0.864 0 1 0 c
7 -0.306 0 0 1 foo
All looks normal when we fit this model to our training features
fit_novel <-
linear_reg() %>%
set_engine("lm") %>%
fit(y ~ ., data = feat_trn_novel)However, if we look at the parameter estimates, we see that the algorithm was unable to estimate a parameter for x_foo because it was a constant in train. Of course, this makes sense because there were no observations of foo in training so the model coouldnt learn how that new level differed from the reference level.
fit_novel %>% tidy()# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.335 0.719 0.466 0.673
2 x_b -0.806 1.02 -0.792 0.486
3 x_c -0.161 1.02 -0.158 0.885
4 x_new NA NA NA NA
This model will now generate a warning (“prediction from a rank-deficient fit has doubtful cases”) when you use this model to make predictions for values it didnt see in the training set.
y value associated with the reference level (coded 0 for all other dummy features) for all foo observations. This is probably the best we can do for these new (previously unseen) values for x.predict(fit_novel, feat_test_novel) |>
bind_cols(feat_test_novel)Warning in predict.lm(object = object$fit, newdata = new_data, type =
"response", : prediction from rank-deficient fit; consider predict(.,
rankdeficient="NA")
# A tibble: 7 × 5
.pred y x_b x_c x_new
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.335 0.836 0 0 0
2 -0.471 0.412 1 0 0
3 0.174 -0.0199 0 1 0
4 0.335 -0.968 0 0 0
5 -0.471 0.00296 1 0 0
6 0.174 -0.864 0 1 0
7 0.335 -0.306 0 0 1
You do not need to use step_novel() always. Just put it into a recipe if you find that there are novel levels in your held-out data (and re-prep the recipe after you add that step of course~)