Novel Levels in Held-Out Set(s)

When you have nominal/ordinal predictors that have levels that are infrequent, you will occasionally find that an infrequent level appears in your held out set (i.e., validation or test) but not in your training set. This can cause problems when you try to make predictions for these new values. Specifically, the feature values for this level will be set to NA and therefore, you will get predictions of NA for these observations.

In this appendix, we demonstrate this problem and our preferred solution given our workflow of classing all nominal/ordinal predictors as factors in our dataframes.

Code

library(tidyverse) 
library(tidymodels) 
devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true")

Make simple data sets with an outcome (y) and one nominal predictor (x). Note that x will have a novel value (foo) in the test set that wasnt present in the training set.

Code

n <- 6
data_trn <- tibble(y = rnorm(n), 
                   x = rep (c("a", "b", "c"), n/3)) |>
  mutate(x = factor(x, levels = c("a", "b", "c"))) |> 
  print()

# A tibble: 6 × 2
        y x    
    <dbl> <fct>
1  1.16   a    
2 -1.01   b    
3  0.929  c    
4 -0.494  a    
5  0.0734 b    
6 -0.580  c

Code

data_test <- tibble(y = c(rnorm(n), rnorm(1)),
                    x = c(rep (c("a", "b", "c"), n/3), "foo")) |> 
  mutate(x = factor(x, levels = c("a", "b", "c", "foo"))) |> 
  print()

# A tibble: 7 × 2
         y x    
     <dbl> <fct>
1  0.836   a    
2  0.412   b    
3 -0.0199  c    
4 -0.968   a    
5  0.00296 b    
6 -0.864   c    
7 -0.306   foo

Make a recipe

Code

rec <- recipe(y ~ x, data = data_trn) %>% 
  step_dummy(x)

Prep the recipe with training data

Code

rec_prep <- rec |> 
  prep(data_trn)

Features for training set. No problems

Code

feat_trn <- rec_prep |> 
  bake(NULL)

feat_trn |> skim_all()

Data summary
Name	feat_trn
Number of rows	6
Number of columns	3
_______________________
Column type frequency:
numeric	3
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	skew	kurtosis
y	1	0.01	0.88	-1.01	-0.56	-0.21	0.71	1.16	0.22	-1.93
x_b	1	0.33	0.52	0.00	0.00	0.00	0.75	1.00	0.54	-1.96
x_c	1	0.33	0.52	0.00	0.00	0.00	0.75	1.00	0.54	-1.96

Features for test set.

Now we see the problem indicated by the warning about new level in test.
We see that one observation is missing for x in test. If we looked closer, we would see this is the observation for foo

Code

feat_test <- rec_prep |> 
  bake(data_test)

Warning: ! There are new levels in `x`: foo.
ℹ Consider using step_novel() (`?recipes::step_novel()`) \ before
  `step_dummy()` to handle unseen values.

Code

feat_test |> skim_all()

Data summary
Name	feat_test
Number of rows	7
Number of columns	3
_______________________
Column type frequency:
numeric	3
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	skew	kurtosis
y	0	1.00	-0.13	0.65	-0.97	-0.59	-0.02	0.21	0.84	0.04	-1.60
x_b	1	0.86	0.33	0.52	0.00	0.00	0.00	0.75	1.00	0.54	-1.96
x_c	1	0.86	0.33	0.52	0.00	0.00	0.00	0.75	1.00	0.54	-1.96

We handle this problem of potential new levels in held-out data by inserting step_novel() prior to step_dummy() in our recipe. This assigns all potential novel (unseen in training) levels to a new category called new by default

Code

rec_novel <- recipe(y ~ x, data = data_trn) |>  
  step_novel(x) |> 
  step_dummy(x)

When we now prep this recipe using training data that does not contain foo (our novel level we will find in test), everything is fine

Code

rec_novel_prep <- rec_novel |> 
  prep(data_trn)

When we bake features for training data, we see what step_novel() did. It added a new level and therefore a new feature to code the contrast of that level with the reference level. However, given that this new level was not present in our training data, all observations are assigned a zero for this new feature.

Code

feat_trn_novel <- rec_novel_prep |> 
  bake(NULL)

feat_trn_novel |> bind_cols(data_trn |> select(x)) |> print()

# A tibble: 6 × 5
        y   x_b   x_c x_new x    
    <dbl> <dbl> <dbl> <dbl> <fct>
1  1.16       0     0     0 a    
2 -1.01       1     0     0 b    
3  0.929      0     1     0 c    
4 -0.494      0     0     0 a    
5  0.0734     1     0     0 b    
6 -0.580      0     1     0 c

But now when we bake the test data, this new feature is set to 1 for observations associated with this new level foo

Code

feat_test_novel <- rec_novel_prep |> 
  bake(data_test)

feat_test_novel |> bind_cols(data_test |> select(x)) |> print()

# A tibble: 7 × 5
         y   x_b   x_c x_new x    
     <dbl> <dbl> <dbl> <dbl> <fct>
1  0.836       0     0     0 a    
2  0.412       1     0     0 b    
3 -0.0199      0     1     0 c    
4 -0.968       0     0     0 a    
5  0.00296     1     0     0 b    
6 -0.864       0     1     0 c    
7 -0.306       0     0     1 foo

All looks normal when we fit this model to our training features

Code

fit_novel <-
  linear_reg() %>% 
  set_engine("lm") %>% 
  fit(y ~ ., data = feat_trn_novel)

However, if we look at the parameter estimates, we see that the algorithm was unable to estimate a parameter for x_foo because it was a constant in train. Of course, this makes sense because there were no observations of foo in training so the model coouldnt learn how that new level differed from the reference level.

Code

fit_novel %>% tidy()

# A tibble: 4 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    0.335     0.719     0.466   0.673
2 x_b           -0.806     1.02     -0.792   0.486
3 x_c           -0.161     1.02     -0.158   0.885
4 x_new         NA        NA        NA      NA

This model will now generate a warning (“prediction from a rank-deficient fit has doubtful cases”) when you use this model to make predictions for values it didnt see in the training set.

The consequence is that the model will predict a y value associated with the reference level (coded 0 for all other dummy features) for all foo observations. This is probably the best we can do for these new (previously unseen) values for x.

Code

predict(fit_novel, feat_test_novel) |>  
  bind_cols(feat_test_novel)

Warning in predict.lm(object = object$fit, newdata = new_data, type =
"response", : prediction from rank-deficient fit; consider predict(.,
rankdeficient="NA")

# A tibble: 7 × 5
   .pred        y   x_b   x_c x_new
   <dbl>    <dbl> <dbl> <dbl> <dbl>
1  0.335  0.836       0     0     0
2 -0.471  0.412       1     0     0
3  0.174 -0.0199      0     1     0
4  0.335 -0.968       0     0     0
5 -0.471  0.00296     1     0     0
6  0.174 -0.864       0     1     0
7  0.335 -0.306       0     0     1

You do not need to use step_novel() always. Just put it into a recipe if you find that there are novel levels in your held-out data (and re-prep the recipe after you add that step of course~)