library(tidyverse)
library(tidymodels)
::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true") devtools
17 Novel Levels in Held-Out Set(s)
When you have nominal/ordinal predictors that have levels that are infrequent, you will occasionally find that an infrequent level appears in your held out set (i.e., validation or test) but not in your training set. This can cause problems when you try to make predictions for these new values. Specifically, the feature values for this level will be set to NA and therefore, you will get predictions of NA for these observations.
In this appendix, we demonstrate this problem and our preferred solution given our workflow of classing all nominal/ordinal predictors as factors in our dataframes.
Make simple data sets with an outcome (y
) and one nominal predictor (x
). Note that x
will have a novel value (foo
) in the test set that wasnt present in the training set.
<- 6
n <- tibble(y = rnorm(n),
data_trn x = rep (c("a", "b", "c"), n/3)) |>
mutate(x = factor(x, levels = c("a", "b", "c"))) |>
print()
# A tibble: 6 × 2
y x
<dbl> <fct>
1 1.03 a
2 -0.478 b
3 0.772 c
4 0.948 a
5 0.214 b
6 0.333 c
<- tibble(y = c(rnorm(n), rnorm(1)),
data_test x = c(rep (c("a", "b", "c"), n/3), "foo")) |>
mutate(x = factor(x, levels = c("a", "b", "c", "foo"))) |>
print()
# A tibble: 7 × 2
y x
<dbl> <fct>
1 -0.0664 a
2 -0.520 b
3 -0.994 c
4 -0.980 a
5 -0.191 b
6 0.125 c
7 0.783 foo
Make a recipe
<- recipe(y ~ x, data = data_trn) %>%
rec step_dummy(x)
Prep the recipe with training data
<- rec |>
rec_prep prep(data_trn)
Features for training set. No problems
<- rec_prep |>
feat_trn bake(data_trn)
|> skim_all() feat_trn
Name | feat_trn |
Number of rows | 6 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | skew | kurtosis |
---|---|---|---|---|---|---|---|---|---|---|---|
y | 0 | 1 | 0.47 | 0.57 | -0.48 | 0.24 | 0.55 | 0.90 | 1.03 | -0.51 | -1.45 |
x_b | 0 | 1 | 0.33 | 0.52 | 0.00 | 0.00 | 0.00 | 0.75 | 1.00 | 0.54 | -1.96 |
x_c | 0 | 1 | 0.33 | 0.52 | 0.00 | 0.00 | 0.00 | 0.75 | 1.00 | 0.54 | -1.96 |
Features for test set.
- Now we see the problem indicated by the warning about new level in test.
- We see that one observation is missing for
x
in test. If we looked closer, we would see this is the observation forfoo
<- rec_prep |>
feat_test bake(data_test)
Warning: ! There are new levels in a factor: `foo`.
|> skim_all() feat_test
Name | feat_test |
Number of rows | 7 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | skew | kurtosis |
---|---|---|---|---|---|---|---|---|---|---|---|
y | 0 | 1.00 | -0.26 | 0.63 | -0.99 | -0.75 | -0.19 | 0.03 | 0.78 | 0.25 | -1.42 |
x_b | 1 | 0.86 | 0.33 | 0.52 | 0.00 | 0.00 | 0.00 | 0.75 | 1.00 | 0.54 | -1.96 |
x_c | 1 | 0.86 | 0.33 | 0.52 | 0.00 | 0.00 | 0.00 | 0.75 | 1.00 | 0.54 | -1.96 |
We solve this problem but just making sure this level was listed when we created the factor in training (e.g., use this mutate earlier when classing x
in data_trn
: mutate(x = factor(x, levels = c("a", "b", "c", "foo")))
.
Or we can add the level after the fact, when we discover the problem (as below).
<- data_trn |>
data_trn1 mutate(x = factor(x, levels = c("a", "b", "c", "foo")))
Now prep recipe with this updated training set that includes foo
level
<- rec |>
rec_prep1 prep(data_trn1)
Features for training as before
- We now have a feature for this new level
- It is set to 0 for all observations (because there are no observations with a value of
foo
in training set)
<- rec_prep1 |>
feat_trn1 bake(data_trn1)
|> skim_all() feat_trn1
Name | feat_trn1 |
Number of rows | 6 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | skew | kurtosis |
---|---|---|---|---|---|---|---|---|---|---|---|
y | 0 | 1 | 0.47 | 0.57 | -0.48 | 0.24 | 0.55 | 0.90 | 1.03 | -0.51 | -1.45 |
x_b | 0 | 1 | 0.33 | 0.52 | 0.00 | 0.00 | 0.00 | 0.75 | 1.00 | 0.54 | -1.96 |
x_c | 0 | 1 | 0.33 | 0.52 | 0.00 | 0.00 | 0.00 | 0.75 | 1.00 | 0.54 | -1.96 |
x_foo | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | NaN | NaN |
Now there is no problem when we find this value for an observation in the test set.
<- rec_prep1 |>
feat_test1 bake(data_test)
|> skim_all() feat_test1
Name | feat_test1 |
Number of rows | 7 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | skew | kurtosis |
---|---|---|---|---|---|---|---|---|---|---|---|
y | 0 | 1 | -0.26 | 0.63 | -0.99 | -0.75 | -0.19 | 0.03 | 0.78 | 0.25 | -1.42 |
x_b | 0 | 1 | 0.29 | 0.49 | 0.00 | 0.00 | 0.00 | 0.50 | 1.00 | 0.75 | -1.60 |
x_c | 0 | 1 | 0.29 | 0.49 | 0.00 | 0.00 | 0.00 | 0.50 | 1.00 | 0.75 | -1.60 |
x_foo | 0 | 1 | 0.14 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.62 | 0.80 |
All is good. BUT, there are still some complexities when we fit this model in train and predict into test. In training, the x_foo
feature is a constant (all 0) so this will present some issues for some statistical algorithms. Lets see what happens when we fit a linear model and use it to predict into test.
<-
fit1 linear_reg() %>%
set_engine("lm") %>%
fit(y ~ ., data = feat_trn1)
If we look at the parameter estimates, we see that the algorithm was unable to estimate a parameter for x_foo
because it was a constant in train
%>% tidy() fit1
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.989 0.238 4.16 0.0253
2 x_b -1.12 0.336 -3.33 0.0446
3 x_c -0.436 0.336 -1.30 0.285
4 x_foo NA NA NA NA
This will generate a warning (“prediction from a rank-deficient fit has doubtful cases”) when you use this model to make predictions for values it didnt see in the training set.
- The consequence is that the model will predict a
y
value associated with the reference level (coded 0 for all other dummy features) for allfoo
observations. This is probably the best we can do for these new (previously unseen) values for x. - also note that the column name for predictions, which is usually called
.pred
, is now called.pred_res
. You will need to accomodate this in your code as well. Just rename it.
predict(fit1, feat_test1) |>
bind_cols(feat_test1)
Warning in predict.lm(object = object$fit, newdata = new_data, type =
"response", : prediction from rank-deficient fit; consider predict(.,
rankdeficient="NA")
# A tibble: 7 × 5
.pred y x_b x_c x_foo
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.989 -0.0664 0 0 0
2 -0.132 -0.520 1 0 0
3 0.553 -0.994 0 1 0
4 0.989 -0.980 0 0 0
5 -0.132 -0.191 1 0 0
6 0.553 0.125 0 1 0
7 0.989 0.783 0 0 1
This is our preferred solution when new/previously unseen values exist in held out data. A comparable solution is offered as a recipe step. See step_novel()