Code
library(tidyverse)
library(tidymodels)
::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true") devtools
When you have nominal/ordinal predictors that have levels that are infrequent, you will occasionally find that an infrequent level appears in your held out set (i.e., validation or test) but not in your training set. This can cause problems when you try to make predictions for these new values. Specifically, the feature values for this level will be set to NA and therefore, you will get predictions of NA for these observations.
In this appendix, we demonstrate this problem and our preferred solution given our workflow of classing all nominal/ordinal predictors as factors in our dataframes.
library(tidyverse)
library(tidymodels)
::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true") devtools
Make simple data sets with an outcome (y
) and one nominal predictor (x
). Note that x
will have a novel value (foo
) in the test set that wasnt present in the training set.
<- 6
n <- tibble(y = rnorm(n),
data_trn x = rep (c("a", "b", "c"), n/3)) |>
mutate(x = factor(x, levels = c("a", "b", "c"))) |>
print()
# A tibble: 6 × 2
y x
<dbl> <fct>
1 1.16 a
2 -1.01 b
3 0.929 c
4 -0.494 a
5 0.0734 b
6 -0.580 c
<- tibble(y = c(rnorm(n), rnorm(1)),
data_test x = c(rep (c("a", "b", "c"), n/3), "foo")) |>
mutate(x = factor(x, levels = c("a", "b", "c", "foo"))) |>
print()
# A tibble: 7 × 2
y x
<dbl> <fct>
1 0.836 a
2 0.412 b
3 -0.0199 c
4 -0.968 a
5 0.00296 b
6 -0.864 c
7 -0.306 foo
Make a recipe
<- recipe(y ~ x, data = data_trn) %>%
rec step_dummy(x)
Prep the recipe with training data
<- rec |>
rec_prep prep(data_trn)
Features for training set. No problems
<- rec_prep |>
feat_trn bake(NULL)
|> skim_all() feat_trn
Name | feat_trn |
Number of rows | 6 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | skew | kurtosis |
---|---|---|---|---|---|---|---|---|---|---|---|
y | 0 | 1 | 0.01 | 0.88 | -1.01 | -0.56 | -0.21 | 0.71 | 1.16 | 0.22 | -1.93 |
x_b | 0 | 1 | 0.33 | 0.52 | 0.00 | 0.00 | 0.00 | 0.75 | 1.00 | 0.54 | -1.96 |
x_c | 0 | 1 | 0.33 | 0.52 | 0.00 | 0.00 | 0.00 | 0.75 | 1.00 | 0.54 | -1.96 |
Features for test set.
x
in test. If we looked closer, we would see this is the observation for foo
<- rec_prep |>
feat_test bake(data_test)
Warning: ! There are new levels in `x`: foo.
ℹ Consider using step_novel() (`?recipes::step_novel()`) \ before
`step_dummy()` to handle unseen values.
|> skim_all() feat_test
Name | feat_test |
Number of rows | 7 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | skew | kurtosis |
---|---|---|---|---|---|---|---|---|---|---|---|
y | 0 | 1.00 | -0.13 | 0.65 | -0.97 | -0.59 | -0.02 | 0.21 | 0.84 | 0.04 | -1.60 |
x_b | 1 | 0.86 | 0.33 | 0.52 | 0.00 | 0.00 | 0.00 | 0.75 | 1.00 | 0.54 | -1.96 |
x_c | 1 | 0.86 | 0.33 | 0.52 | 0.00 | 0.00 | 0.00 | 0.75 | 1.00 | 0.54 | -1.96 |
We handle this problem of potential new levels in held-out data by inserting step_novel()
prior to step_dummy()
in our recipe. This assigns all potential novel (unseen in training) levels to a new category called new
by default
<- recipe(y ~ x, data = data_trn) |>
rec_novel step_novel(x) |>
step_dummy(x)
When we now prep this recipe using training data that does not contain foo
(our novel level we will find in test), everything is fine
<- rec_novel |>
rec_novel_prep prep(data_trn)
When we bake features for training data, we see what step_novel()
did. It added a new level and therefore a new feature to code the contrast of that level with the reference level. However, given that this new level was not present in our training data, all observations are assigned a zero for this new feature.
<- rec_novel_prep |>
feat_trn_novel bake(NULL)
|> bind_cols(data_trn |> select(x)) |> print() feat_trn_novel
# A tibble: 6 × 5
y x_b x_c x_new x
<dbl> <dbl> <dbl> <dbl> <fct>
1 1.16 0 0 0 a
2 -1.01 1 0 0 b
3 0.929 0 1 0 c
4 -0.494 0 0 0 a
5 0.0734 1 0 0 b
6 -0.580 0 1 0 c
But now when we bake the test data, this new feature is set to 1 for observations associated with this new level foo
<- rec_novel_prep |>
feat_test_novel bake(data_test)
|> bind_cols(data_test |> select(x)) |> print() feat_test_novel
# A tibble: 7 × 5
y x_b x_c x_new x
<dbl> <dbl> <dbl> <dbl> <fct>
1 0.836 0 0 0 a
2 0.412 1 0 0 b
3 -0.0199 0 1 0 c
4 -0.968 0 0 0 a
5 0.00296 1 0 0 b
6 -0.864 0 1 0 c
7 -0.306 0 0 1 foo
All looks normal when we fit this model to our training features
<-
fit_novel linear_reg() %>%
set_engine("lm") %>%
fit(y ~ ., data = feat_trn_novel)
However, if we look at the parameter estimates, we see that the algorithm was unable to estimate a parameter for x_foo
because it was a constant in train. Of course, this makes sense because there were no observations of foo
in training so the model coouldnt learn how that new level differed from the reference level.
%>% tidy() fit_novel
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.335 0.719 0.466 0.673
2 x_b -0.806 1.02 -0.792 0.486
3 x_c -0.161 1.02 -0.158 0.885
4 x_new NA NA NA NA
This model will now generate a warning (“prediction from a rank-deficient fit has doubtful cases”) when you use this model to make predictions for values it didnt see in the training set.
y
value associated with the reference level (coded 0 for all other dummy features) for all foo
observations. This is probably the best we can do for these new (previously unseen) values for x.predict(fit_novel, feat_test_novel) |>
bind_cols(feat_test_novel)
Warning in predict.lm(object = object$fit, newdata = new_data, type =
"response", : prediction from rank-deficient fit; consider predict(.,
rankdeficient="NA")
# A tibble: 7 × 5
.pred y x_b x_c x_new
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.335 0.836 0 0 0
2 -0.471 0.412 1 0 0
3 0.174 -0.0199 0 1 0
4 0.335 -0.968 0 0 0
5 -0.471 0.00296 1 0 0
6 0.174 -0.864 0 1 0
7 0.335 -0.306 0 0 1
You do not need to use step_novel()
always. Just put it into a recipe if you find that there are novel levels in your held-out data (and re-prep the recipe after you add that step of course~)