17  Novel Levels in Held-Out Set(s)

When you have nominal/ordinal predictors that have levels that are infrequent, you will occasionally find that an infrequent level appears in your held out set (i.e., validation or test) but not in your training set. This can cause problems when you try to make predictions for these new values. Specifically, the feature values for this level will be set to NA and therefore, you will get predictions of NA for these observations.

In this appendix, we demonstrate this problem and our preferred solution given our workflow of classing all nominal/ordinal predictors as factors in our dataframes.

library(tidyverse) 
library(tidymodels) 
devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true")

Make simple data sets with an outcome (y) and one nominal predictor (x). Note that x will have a novel value (foo) in the test set that wasnt present in the training set.

n <- 6
data_trn <- tibble(y = rnorm(n), 
                   x = rep (c("a", "b", "c"), n/3)) |>
  mutate(x = factor(x, levels = c("a", "b", "c"))) |> 
  print()
# A tibble: 6 × 2
       y x    
   <dbl> <fct>
1  1.03  a    
2 -0.478 b    
3  0.772 c    
4  0.948 a    
5  0.214 b    
6  0.333 c    
data_test <- tibble(y = c(rnorm(n), rnorm(1)),
                    x = c(rep (c("a", "b", "c"), n/3), "foo")) |> 
  mutate(x = factor(x, levels = c("a", "b", "c", "foo"))) |> 
  print()
# A tibble: 7 × 2
        y x    
    <dbl> <fct>
1 -0.0664 a    
2 -0.520  b    
3 -0.994  c    
4 -0.980  a    
5 -0.191  b    
6  0.125  c    
7  0.783  foo  

Make a recipe

rec <- recipe(y ~ x, data = data_trn) %>% 
  step_dummy(x)

Prep the recipe with training data

rec_prep <- rec |> 
  prep(data_trn)

Features for training set. No problems

feat_trn <- rec_prep |> 
  bake(data_trn)

feat_trn |> skim_all()
Data summary
Name feat_trn
Number of rows 6
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 skew kurtosis
y 0 1 0.47 0.57 -0.48 0.24 0.55 0.90 1.03 -0.51 -1.45
x_b 0 1 0.33 0.52 0.00 0.00 0.00 0.75 1.00 0.54 -1.96
x_c 0 1 0.33 0.52 0.00 0.00 0.00 0.75 1.00 0.54 -1.96

Features for test set.

feat_test <- rec_prep |> 
  bake(data_test)
Warning: ! There are new levels in a factor: `foo`.
feat_test |> skim_all()
Data summary
Name feat_test
Number of rows 7
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 skew kurtosis
y 0 1.00 -0.26 0.63 -0.99 -0.75 -0.19 0.03 0.78 0.25 -1.42
x_b 1 0.86 0.33 0.52 0.00 0.00 0.00 0.75 1.00 0.54 -1.96
x_c 1 0.86 0.33 0.52 0.00 0.00 0.00 0.75 1.00 0.54 -1.96

We solve this problem but just making sure this level was listed when we created the factor in training (e.g., use this mutate earlier when classing x in data_trn: mutate(x = factor(x, levels = c("a", "b", "c", "foo"))).

Or we can add the level after the fact, when we discover the problem (as below).

data_trn1 <- data_trn |> 
  mutate(x = factor(x, levels = c("a", "b", "c", "foo")))

Now prep recipe with this updated training set that includes foo level

rec_prep1 <- rec |> 
  prep(data_trn1)

Features for training as before

feat_trn1 <- rec_prep1 |> 
  bake(data_trn1)

feat_trn1 |> skim_all()
Data summary
Name feat_trn1
Number of rows 6
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 skew kurtosis
y 0 1 0.47 0.57 -0.48 0.24 0.55 0.90 1.03 -0.51 -1.45
x_b 0 1 0.33 0.52 0.00 0.00 0.00 0.75 1.00 0.54 -1.96
x_c 0 1 0.33 0.52 0.00 0.00 0.00 0.75 1.00 0.54 -1.96
x_foo 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN

Now there is no problem when we find this value for an observation in the test set.

feat_test1 <- rec_prep1 |> 
  bake(data_test)

feat_test1 |> skim_all()
Data summary
Name feat_test1
Number of rows 7
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 skew kurtosis
y 0 1 -0.26 0.63 -0.99 -0.75 -0.19 0.03 0.78 0.25 -1.42
x_b 0 1 0.29 0.49 0.00 0.00 0.00 0.50 1.00 0.75 -1.60
x_c 0 1 0.29 0.49 0.00 0.00 0.00 0.50 1.00 0.75 -1.60
x_foo 0 1 0.14 0.38 0.00 0.00 0.00 0.00 1.00 1.62 0.80

All is good. BUT, there are still some complexities when we fit this model in train and predict into test. In training, the x_foo feature is a constant (all 0) so this will present some issues for some statistical algorithms. Lets see what happens when we fit a linear model and use it to predict into test.

fit1 <-
  linear_reg() %>% 
  set_engine("lm") %>% 
  fit(y ~ ., data = feat_trn1)

If we look at the parameter estimates, we see that the algorithm was unable to estimate a parameter for x_foo because it was a constant in train

fit1 %>% tidy()
# A tibble: 4 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    0.989     0.238      4.16  0.0253
2 x_b           -1.12      0.336     -3.33  0.0446
3 x_c           -0.436     0.336     -1.30  0.285 
4 x_foo         NA        NA         NA    NA     

This will generate a warning (“prediction from a rank-deficient fit has doubtful cases”) when you use this model to make predictions for values it didnt see in the training set.

predict(fit1, feat_test1) |>  
  bind_cols(feat_test1)
Warning in predict.lm(object = object$fit, newdata = new_data, type =
"response", : prediction from rank-deficient fit; consider predict(.,
rankdeficient="NA")
# A tibble: 7 × 5
   .pred       y   x_b   x_c x_foo
   <dbl>   <dbl> <dbl> <dbl> <dbl>
1  0.989 -0.0664     0     0     0
2 -0.132 -0.520      1     0     0
3  0.553 -0.994      0     1     0
4  0.989 -0.980      0     0     0
5 -0.132 -0.191      1     0     0
6  0.553  0.125      0     1     0
7  0.989  0.783      0     0     1

This is our preferred solution when new/previously unseen values exist in held out data. A comparable solution is offered as a recipe step. See step_novel()