Novel Levels in Held-Out Set(s)

When you have nominal/ordinal predictors that have levels that are infrequent, you will occasionally find that an infrequent level appears in your held out set (i.e., validation or test) but not in your training set. This can cause problems when you try to make predictions for these new values. Specifically, the feature values for this level will be set to NA and therefore, you will get predictions of NA for these observations.

In this appendix, we demonstrate this problem and our preferred solution given our workflow of classing all nominal/ordinal predictors as factors in our dataframes.

Code
library(tidyverse) 
library(tidymodels) 
devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true")

Make simple data sets with an outcome (y) and one nominal predictor (x). Note that x will have a novel value (foo) in the test set that wasnt present in the training set.

Code
n <- 6
data_trn <- tibble(y = rnorm(n), 
                   x = rep (c("a", "b", "c"), n/3)) |>
  mutate(x = factor(x, levels = c("a", "b", "c"))) |> 
  print()
# A tibble: 6 × 2
        y x    
    <dbl> <fct>
1  1.16   a    
2 -1.01   b    
3  0.929  c    
4 -0.494  a    
5  0.0734 b    
6 -0.580  c    
Code
data_test <- tibble(y = c(rnorm(n), rnorm(1)),
                    x = c(rep (c("a", "b", "c"), n/3), "foo")) |> 
  mutate(x = factor(x, levels = c("a", "b", "c", "foo"))) |> 
  print()
# A tibble: 7 × 2
         y x    
     <dbl> <fct>
1  0.836   a    
2  0.412   b    
3 -0.0199  c    
4 -0.968   a    
5  0.00296 b    
6 -0.864   c    
7 -0.306   foo  

Make a recipe

Code
rec <- recipe(y ~ x, data = data_trn) %>% 
  step_dummy(x)

Prep the recipe with training data

Code
rec_prep <- rec |> 
  prep(data_trn)

Features for training set. No problems

Code
feat_trn <- rec_prep |> 
  bake(NULL)

feat_trn |> skim_all()
Data summary
Name feat_trn
Number of rows 6
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 skew kurtosis
y 0 1 0.01 0.88 -1.01 -0.56 -0.21 0.71 1.16 0.22 -1.93
x_b 0 1 0.33 0.52 0.00 0.00 0.00 0.75 1.00 0.54 -1.96
x_c 0 1 0.33 0.52 0.00 0.00 0.00 0.75 1.00 0.54 -1.96

Features for test set.

Code
feat_test <- rec_prep |> 
  bake(data_test)
Warning: ! There are new levels in `x`: foo.
ℹ Consider using step_novel() (`?recipes::step_novel()`) \ before
  `step_dummy()` to handle unseen values.
Code
feat_test |> skim_all()
Data summary
Name feat_test
Number of rows 7
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 skew kurtosis
y 0 1.00 -0.13 0.65 -0.97 -0.59 -0.02 0.21 0.84 0.04 -1.60
x_b 1 0.86 0.33 0.52 0.00 0.00 0.00 0.75 1.00 0.54 -1.96
x_c 1 0.86 0.33 0.52 0.00 0.00 0.00 0.75 1.00 0.54 -1.96

We handle this problem of potential new levels in held-out data by inserting step_novel() prior to step_dummy() in our recipe. This assigns all potential novel (unseen in training) levels to a new category called new by default

Code
rec_novel <- recipe(y ~ x, data = data_trn) |>  
  step_novel(x) |> 
  step_dummy(x)

When we now prep this recipe using training data that does not contain foo (our novel level we will find in test), everything is fine

Code
rec_novel_prep <- rec_novel |> 
  prep(data_trn)

When we bake features for training data, we see what step_novel() did. It added a new level and therefore a new feature to code the contrast of that level with the reference level. However, given that this new level was not present in our training data, all observations are assigned a zero for this new feature.

Code
feat_trn_novel <- rec_novel_prep |> 
  bake(NULL)

feat_trn_novel |> bind_cols(data_trn |> select(x)) |> print()
# A tibble: 6 × 5
        y   x_b   x_c x_new x    
    <dbl> <dbl> <dbl> <dbl> <fct>
1  1.16       0     0     0 a    
2 -1.01       1     0     0 b    
3  0.929      0     1     0 c    
4 -0.494      0     0     0 a    
5  0.0734     1     0     0 b    
6 -0.580      0     1     0 c    

But now when we bake the test data, this new feature is set to 1 for observations associated with this new level foo

Code
feat_test_novel <- rec_novel_prep |> 
  bake(data_test)

feat_test_novel |> bind_cols(data_test |> select(x)) |> print()
# A tibble: 7 × 5
         y   x_b   x_c x_new x    
     <dbl> <dbl> <dbl> <dbl> <fct>
1  0.836       0     0     0 a    
2  0.412       1     0     0 b    
3 -0.0199      0     1     0 c    
4 -0.968       0     0     0 a    
5  0.00296     1     0     0 b    
6 -0.864       0     1     0 c    
7 -0.306       0     0     1 foo  

All looks normal when we fit this model to our training features

Code
fit_novel <-
  linear_reg() %>% 
  set_engine("lm") %>% 
  fit(y ~ ., data = feat_trn_novel)

However, if we look at the parameter estimates, we see that the algorithm was unable to estimate a parameter for x_foo because it was a constant in train. Of course, this makes sense because there were no observations of foo in training so the model coouldnt learn how that new level differed from the reference level.

Code
fit_novel %>% tidy()
# A tibble: 4 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    0.335     0.719     0.466   0.673
2 x_b           -0.806     1.02     -0.792   0.486
3 x_c           -0.161     1.02     -0.158   0.885
4 x_new         NA        NA        NA      NA    

This model will now generate a warning (“prediction from a rank-deficient fit has doubtful cases”) when you use this model to make predictions for values it didnt see in the training set.

Code
predict(fit_novel, feat_test_novel) |>  
  bind_cols(feat_test_novel)
Warning in predict.lm(object = object$fit, newdata = new_data, type =
"response", : prediction from rank-deficient fit; consider predict(.,
rankdeficient="NA")
# A tibble: 7 × 5
   .pred        y   x_b   x_c x_new
   <dbl>    <dbl> <dbl> <dbl> <dbl>
1  0.335  0.836       0     0     0
2 -0.471  0.412       1     0     0
3  0.174 -0.0199      0     1     0
4  0.335 -0.968       0     0     0
5 -0.471  0.00296     1     0     0
6  0.174 -0.864       0     1     0
7  0.335 -0.306       0     0     1

You do not need to use step_novel() always. Just put it into a recipe if you find that there are novel levels in your held-out data (and re-prep the recipe after you add that step of course~)