Use of root mean square error (RMSE) in training and validation sets for model performance evaluation
The General Linear Model as a machine learning model
Extensions to categorical variables (Dummy coding features)
Extensions to interactive and non-linear effects of features
K Nearest Neighbor (KNN)
Hyperparameter \(k\)
Scaling predictors
Extensions to categorical variables
Our goal in this unit is to build a machine learning regression model that can accurately (we hope) predict the sale_price for future sales of houses (in Iowa? more generally?)
To begin this project we need to:
Load the packages we will need. I am only loading tidymodels and tidyverse because the other functions we need are only called occasionally (so we will call them by namespace.)
Set up conflicts policies (just in case we decide to load other packages later)
We will hide this in future units
Code
library(tidyverse) # for general data wrangling
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We will fit regression models with various model configurations.
These configurations will differ with respect to statistical algorithm:
A General Linear Model (lm) - a parametric approach
K Nearest Neighbor (KNN) - a non-parametric approach
These configurations will differ with respect to the features
Single feature (i.e., simple regression)
Various sets of multiple features that vary by:
Raw predictors used
Transformations applied to those predictors as part of feature engineering
Inclusion (or not) of interactions among features
The KNN model configurations will also differ with respect to its hyperparameter- \(k\)
To build models that will work well in new data (e.g., the data that I have held back from you so far):
We have split the remaining data into a training and validation set for our own use during model building
We will fit models in train
We will evaluate them in validation
Remember that we:
Used a 75/25 stratified (on sale_price) split of the data at the end of cleaning EDA to create training and validation sets
Are only using a subset of the available predictors. The same ones I used for the EDA unit
You will work with all of my predictors and all the predictors you used for your EDA when you do the application assignment for this unit
3.2 ——————————————————————————–
Pause for a moment to answer this question:
Question
Why do we need independent validation data to select the best model configuration? In other words, why cant we just fit and evaluate all of the models in our one training set?
These models will all overfit the dataset within which they are fit to some degree.
In other words, they will predict both systematic variance (the DGP) and some noise in the training set. However, they will differ in how much they overfit the training set.
As the models get more flexible they will have the potential to overfit to a greater degree.Models with a larger number of features (e.g., more predictors, features based on interactions as well as raw predictors) will overfit to a greater degree. All other things equal, the non-parametric KNN will also be more flexible than the general linear model so it may overfit to a greater degree as well if the true DGP is linear on the features.
Therefore, just because a model fits the training set well does not mean it will work well in new data because the noise will be different in every new dataset. This overfitting will be removed from our performance estimate if we calculate it with new data (the validation set).
Let’s take a quick look at the available raw predictors in the training set
Code
data_trn |>skim_all()
Data summary
Name
data_trn
Number of rows
1463
Number of columns
10
_______________________
Column type frequency:
factor
5
numeric
5
________________________
Group variables
None
Variable type: factor
skim_variable
n_missing
complete_rate
n_unique
top_counts
overall_qual
0
1
10
5: 411, 6: 354, 7: 305, 8: 176
garage_qual
0
1
6
ta: 1299, no_: 86, fa: 62, gd: 12
ms_zoning
0
1
7
res: 1141, res: 225, flo: 72, com: 13
lot_config
0
1
5
ins: 1096, cor: 239, cul: 83, fr2: 42
bldg_type
0
1
5
one: 1211, tow: 110, dup: 65, tow: 51
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
skew
kurtosis
sale_price
0
1
180307.89
78086.25
12789
129450.0
160000
213500
625000
1.48
3.17
gr_liv_area
0
1
1503.46
490.65
438
1139.5
1452
1762
3672
0.93
1.29
lot_area
0
1
9942.60
6898.10
1476
7405.5
9360
11391
164660
11.26
213.25
year_built
0
1
1971.60
29.70
1880
1954.0
1973
2001
2010
-0.56
-0.59
garage_cars
1
1
1.77
0.77
0
1.0
2
2
4
-0.29
0.09
Remember from our modeling EDA that we have some issues to address as part of our feature engineering:
Missing values
Possible transformation of sale_price
Possible transformation of other numeric predictors
We will need to use some feature engineering techniques to handle categorical variables
We may need to consider interactions among features
All of this will be accomplished with a recipe
But first, let’s discuss/review our first statistical algorithm
3.3 The Simple (General) Linear Model (LM)
We will start with only a quick review of the use of the simple (one feature) linear model (LM) as a machine learning model because you should be very familiar with this statistical model at this point
\(Y = \beta_0 + \beta_1*X_1 + \epsilon\)
Applied to our regression problem, we might fit a model such as:
The (general) linear model is a parametric model. We need to estimate two parameters using our training set
\(\beta_0\)
\(\beta_1\)
You already know how to do this using lm() in base R. However, we will use the tidymodels modeling approach.
We use tidymodels because:
It provides a consistent interface to many (and growing numbers) of statistical algorithms
It provides very strong and easy feature engineering routines (e.g., missing data, scaling, transformations, near-zero variance, collinearity) via recipes
It simplifies model performance evaluation using resampling approaches (that you don’t know yet!)
It supports numerous performance metrics
It is tightly integrated within the tidyverse
It is under active development and support
You can see documentation for all of the packages at the tidymodels website. It is worth a quick review now to get a sense of what is available
To fit a model with a specific configuration, we need to:
Set up a feature engineering recipe
Use the recipe to make a feature matrix
prep() it with training data
bake() it with data you want to use to calculate feature matrix
Select and define the statistical algorithm
Fit the algorithm in the feature matrix in our training set
These steps are accomplished with functions from the recipes and parsnip packages.
We will start with a simple model configuration
General linear model (lm)
One feature (raw gr_liv_area)
Fit in training data
Set up a VERY SIMPLE feature engineering recipe
Include outcome on the left size of ~
Include raw predictors (not yet features) on the right side of ~.
Indicate the training data
Code
rec <-recipe(sale_price ~ gr_liv_area, data = data_trn)
We can see a summary of it to verify it is doing what you expect by calling
rec
summary(rec)
We can then prep the recipe and bake the data to make our feature matrix from the training dataset
Again, remember we always prep a recipe with training data but use the prepped recipe to bake any data
In this instance we will prep with data_trn and then bake data_trn so that we have features from our training set to train/fit the model.
Code
rec_prep <- rec |>prep(training = data_trn)
And now we can bake the training data to make a feature matrix
Training features were already saved in the prepped recipe so we set new_data = NULL
Code
feat_trn <- rec_prep |>bake(new_data =NULL)
You should always review the feature matrix to make sure it looks as you expect
includes outcome (sale_price)
includes expected feature (gr_liv_area)
Sample size is as expected
No missing data
Code
feat_trn |>skim_all()
Data summary
Name
feat_trn
Number of rows
1463
Number of columns
2
_______________________
Column type frequency:
numeric
2
________________________
Group variables
None
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
skew
kurtosis
gr_liv_area
0
1
1503.46
490.65
438
1139.5
1452
1762
3672
0.93
1.29
sale_price
0
1
180307.89
78086.25
12789
129450.0
160000
213500
625000
1.48
3.17
Now let’s consider the statistical algorithm
tidymodels breaks this apart into two pieces for clarity
First, you specify the broad category of algorithm
You can also better understand how the engine will be called using translate()
Not useful here but will be with more complicated algorithms
Code
linear_reg() |>set_engine("lm") |>translate()
Linear Regression Model Specification (regression)
Computational engine: lm
Model fit template:
stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
Let’s combine our feature matrix with an algorithm to fit a model in our training set using only raw gr_liv_area as a feature
Note the specification of
The category of algorithm
The engine (no need to set mode of engine b/c lm are only for the regression mode)
The use of the . to indicate all features in the matrix.
not that useful here because there is only one feature: gr_liv_area
will be useful when we have many features in the matrix
We use the the feature matrix (rather than raw data) from the training set to fit the model.
Code
fit_lm_1 <-1linear_reg() |>2set_engine("lm") |>3fit(sale_price ~ ., data = feat_trn)
1
category of algorithm
2
engine
3
use of . for all features and use of feature matrix from training set
We can get the parameter estimates, standard errors, and statistical tests for each \(\beta\) = 0 for this model using tidy() from the broom package (loaded as part of the tidyverse)
There are a variety of ways to pull out the estimates for each feature (and intercept)
Option 1: Pull all estimates from the tidy object
Code
fit_lm_1 |>tidy() |>pull(estimate)
[1] 7695.7464 114.8096
Option 2: Extract a single estimate using $ and row number. Be careful that order of features won’t change! This assumes the feature coefficient for the relevant feature is always the second coefficient.
Code
tidy(fit_lm_1)$estimate[[2]]
[1] 114.8096
Option 3: Filter tidy df to the relevant row (using term ==) and pull the estimate. Safer!
Option 4: Write a function if we plan to do this a lot. We include this function in the fun_ml.R script in our repo. Better still (safe and code efficient)!
We can visualize how well this model performs in the validation set by plotting predicted sale_price (\(\hat{sale\_price}\)) vs. sale_price (ground truth in machine learning terminology) for these data
We might do this a lot so let’s write a function. We have also included this function in fun_ml.R
Code
plot_truth <-function(truth, estimate) {ggplot(mapping =aes(x = truth, y = estimate)) +geom_abline(lty =2, color ="red") +geom_point(alpha =0.5) +labs(y ="predicted outcome", x ="outcome") +coord_obs_pred() # scale axes uniformly (from tune package)}
Perfect performance would have all the points right on the dotted line (same value for actual and predicted outcome)
Our model doesn’t do that well yet. Not surprising
Pattern also has some indication of fanning of residuals AND some non-linearity with higher outcome scores that suggests need for a power transformation of outcome (e.g., log)
This is consistent with our earlier modeling EDA
Perhaps not that bad here b/c both sale_price and gr_liv_area were positively skewed
We will need consider this eventually
We can quantify model performance by selecting a performance metric
The yardstick package within the tidymodels framework supports calculation of many performance metrics for regression and classification models
Root mean square error (RMSE) is a common performance metric for regression models
You focused on a related metric, sum of squared error (SSE), in PSY 610/710
RMSE simply divides SSE by N (to get mean squared error; MSE) and then takes the square root to return the metric to the original units for the outcome variable
It is easy to calculate using rmse_vec() from the yardstick package
# A tibble: 1 × 2
model rmse_val
<chr> <dbl>
1 simple linear model 48671.
NOTE: I will continue to bind RMSE to this dataframe for newer models but plan to hide this code chuck to avoid distractions. You can reuse this code repeatedly to track your own models if you like. (Perhaps we should write a function??)
For explanatory purposes, we might want to visualize the relationship between a feature and the outcome (in addition to examining the parameter estimates and the associated statistical tests)
Here is a plot of \(\hat{sale\_price}\) by gr_liv_area superimposed over a scatterplot of the raw data from the validation set
Code
feat_val |>ggplot(aes(x = gr_liv_area)) +geom_point(aes(y = sale_price), color ="gray") +geom_line(aes(y =predict(fit_lm_1, data_val)$.pred), linewidth =1.25, color ="blue") +ggtitle("Validation Set")
As expected, there is a moderately strong positive relationship between gr_liv_area and sale_price.
We can also again see the heteroscadasticity in the errors that might be corrected by a power transformation of sale_price (or gr_liv_area)
3.4 Extension of LM to Multiple Predictors
We can improve model performance by moving from simple linear model to a linear model with multiple features derived from multiple predictors
We have many other numeric variables available to use, even in this pared down version of the dataset.
I am going to stop using new_data = but remember, we use NULL if we want the previously saved training features. If instead, we want features for new data, we provide that dataset.
And take a quick look at the features
Sample size is correct
4 features and the outcome variable
All features are numeric
No missing data for garage_qual
Code
feat_trn |>skim_all()
Data summary
Name
feat_trn
Number of rows
1463
Number of columns
5
_______________________
Column type frequency:
numeric
5
________________________
Group variables
None
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
skew
kurtosis
gr_liv_area
0
1
1503.46
490.65
438
1139.5
1452
1762
3672
0.93
1.29
lot_area
0
1
9942.60
6898.10
1476
7405.5
9360
11391
164660
11.26
213.25
year_built
0
1
1971.60
29.70
1880
1954.0
1973
2001
2010
-0.56
-0.59
garage_cars
0
1
1.77
0.77
0
1.0
2
2
4
-0.30
0.09
sale_price
0
1
180307.89
78086.25
12789
129450.0
160000
213500
625000
1.48
3.17
And finally, bake the validation data with the same prepped recipe to get validation features
Of course, these four features are correlated both with sale_price but also with each other
Let’s look at correlations in the training set.
Code
feat_trn |>cor() |> corrplot::corrplot.mixed()
Question
What are the implications of the correlations among many of these predictors?
The multiple regression model coefficients represent unique effects, controlling for all other variables in the model. You can see how the unique effect of gr_liv_area is smaller than its overall effect from the simple regression. This also means that the overall predictive strength of the model will not be a sum of the effects of each predictor considered in isolation - it will likely be less.
Also, if the correlations are high, problems with multicollinearity will emerge. This will yield large standard errors which means that the models will start to have more variance when fit in different training datasets! We will soon learn about other regularized versions of the GLM that do not have these issues with correlated predictors.
How well does this more complex model perform in validation? Let’s compare the previous and current visualizations of \(sale\_price\) vs. \(\hat{sale\_price}\)
Looks like the errors are smaller (closer to the diagonal line that would represent prefect prediction)
Clear signs of non-linearity are now present as well. Time for more Modeling EDA!!
Notice the use of plot_grid() from the cowplot package to make side by side plots. This also required returning the individual plots as objects (just assign to a object name, e.g., plot_1)
Let’s compare model performance for the two models using RMSE in the validation set
Let’s bind the new performance metric to our results table
# A tibble: 2 × 2
model rmse_val
<chr> <dbl>
1 simple linear model 48671.
2 4 feature linear model 37734.
Given the non-linearity suggested by the truth vs. estimate plots, we might wonder if we could improve the fit if we transformed our features to be closer to normal
We will apply step_YeoJohnson(), which is similar to a Box-Cox transformation but can be more broadly applied because the scores don’t need to be strictly positive
Let’s do it all again, now with transformed features!
# A tibble: 3 × 2
model rmse_val
<chr> <dbl>
1 simple linear model 48671.
2 4 feature linear model 37734.
3 4 feature linear model with YJ 39270.
That didn’t help at all. Error still high and still non-linearity in plot.
We may need to consider
a transformation of sale_price (We will leave that to you for the application assignment if you are daring!)
or a different algorithm that can handle non-linear relationships better
3.5 Extension to Categorical Predictors
Many important predictors in our models may be categorical (nominal and some ordinal predictors)
Some statistical algorithms (e.g., random forest) can accept even nominal predictors as features without any further feature engineering
But many cannot. Linear models cannot.
The type of feature engineering may differ for nominal vs. ordinal categorical predictors
For nominal categorical predictors:
We need to learn a common approach to transform them to numeric features - dummy coding. We will learn the concept in general AND how to accomplish within a feature engineering recipe.
For ordinal predictors:
We can treat them like numeric predictors
We can treat them like nominal categorical predictors
For many algorithms, we will need to use feature engineering to convert a categorical predictor to numeric features. One common technique is to use dummy coding. When dummy coding a predictor, we transform the original categorical predictor with m levels into m-1 dummy coded features.
To better understand how and why we do this, lets consider a version of ms_zoning in the Ames dataset.
Make a df (dataframe) with only sale_price and ms_zoning
2
fct_collapse() from the forcats package is our preferred way to collapse levels of a factor. See fct_recode() for more generic recoding of levels.
3
We could have left this line out and float would have stayed as a level named float
4
Remove original ms_zoning predictor
Take a look at the new predictor
Code
data_dummy |>pull(ms_zoning3) |>table()
commercial floating residential
16 72 1375
Question
Why can’t we simply recode each level with a different consecutive value (e.g., commercial = 1, floating =2 , residential = 3)?
There is no meaningful way to order the numbers that we assign to the levels of this unordered categorical predictor. The shape and strength of the relationship between it and sale_price will completely change based on arbitrary ordering of the levels.
Imagine fitting a straight line to predict sale_price from ms_zoning3 using these three different ways to arbitrarily assign numbers to levels.
Code
data_dummy |>mutate(ms_zoning3 =case_when(ms_zoning3 =="residential"~1, ms_zoning3 =="commercial"~2, ms_zoning3 =="floating"~3)) |>ggplot(aes(x = ms_zoning3, y = sale_price)) +geom_bar(stat="summary", fun ="mean")
Code
data_dummy |>mutate(ms_zoning3 =case_when(ms_zoning3 =="residential"~2, ms_zoning3 =="commercial"~1, ms_zoning3 =="floating"~3)) |>ggplot(aes(x = ms_zoning3, y = sale_price)) +geom_bar(stat="summary", fun ="mean")
Code
data_dummy |>mutate(ms_zoning3 =case_when(ms_zoning3 =="residential"~3, ms_zoning3 =="commercial"~1, ms_zoning3 =="floating"~2)) |>ggplot(aes(x = ms_zoning3, y = sale_price)) +geom_bar(stat="summary", fun ="mean")
Dummy coding resolves this issue.
When using dummy codes, we transform (i.e., feature engineer) our original m-level categorical predictor to m-1 dummy features.
Each of these m-1 features represents a contrast between a specific level of the categorical variable and a reference level
The full set of m-1 features represents the overall effect of the categorical predictor variable.
We assign values of 0 or 1 to each observation on each feature in a meaningful pattern (see below)
For example, with our three-level predictor: ms_zoning3
We need 2 dummmy features (d1, d2) to represent this 3-level categorical predictor
Dummy feature 1 is coded 1 for residential and 0 for all other levels
Dummy feature 2 is coded 1 for floating and 0 for all other levels
Commercial properties are coded 0 for both d1 and d2.
This means that commercial properties will become the reference level against which both residential and floating village are compared.
Because we are focused on prediction, the choice of reference level is mostly arbitrary. For explanatory goals, you might consider which level is best suited to be the reference.
There is much deeper coverage of dummy and other contrast coding in 610/710
We can add these two features manually to the data frame and view a handful of observations to make this coding scheme more concrete
If we now fit a model where we predict sale_price from these two dummy coded features, each feature would represent the contrast of the mean sale_price for the level coded 1 vs. the mean sale_price for the level that is coded 0 for all features (i.e., commercial)
d1 is the contrast of mean sale_price for residential vs. commercial
d2 is the contrast of mean sale_price for floating vs. commercial
The combined effect of these two features represents the overall effect of ms_zoning3 on sale_price
Lets do this quickly in base R using lm() as you have done previously in 610.
Code
m <-lm(sale_price ~ d1 + d2, data = data_dummy) m |>summary()
Call:
lm(formula = sale_price ~ d1 + d2, data = data_dummy)
Residuals:
Min 1Q Median 3Q Max
-166391 -49680 -19680 31194 445820
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 78687 19195 4.099 4.37e-05 ***
d1 100493 19307 5.205 2.22e-07 ***
d2 145745 21221 6.868 9.62e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 76780 on 1460 degrees of freedom
Multiple R-squared: 0.03446, Adjusted R-squared: 0.03313
F-statistic: 26.05 on 2 and 1460 DF, p-value: 7.651e-12
The mean sale price of residential properties is 1.00493^{5} dollars higher than commercial properties.
The mean sale price of floating villages is 1.45745^{5} dollars higher than commercial properties.
To understand this conceptually, it is easiest to visualize the linear model that would predict sale_price with these two dichotomous features.
There are only three columns of sale_price because the only possible values for d1 and d2 (which are both dichotomous) are
0,0 (commercial)
1,0 (residential)
0,1 (floating village)
This regression with two features yields a prediction plane (displayed)
The left/right tilt of the plane will be the parameter estimate for d1 and it is the contrast of residential vs. commercial
The front/back tilt of the plane will be the parameter estimate for d2 and it is the contrast of floating village vs. commercial
Statistical sidebar
Any full rank (# levels - 1) set of features regardless of coding system predicts exactly the same (e.g., dummy, helmert, contrast coding)
Preference among coding systems is simply to get single df contrasts of theoretical importance (i.e., for explanation rather than prediction)
Final (mth) dummy feature is not included b/c its is completely redundant (perfectly multicollinear) with other dummy features. This would also prevent a linear model from fitting (‘dummy variable trap’).
However, some statistical algorithms do not have problems with perfect multicollinearity (e.g., LASSO, ridge regression).
For these algorithms, you will sometimes see modified version of dummy coding called one-hot coding.
This approach uses one additional dummy coded feature for the final category.
We won’t spend time on this but you should be familiar with the term b/c it is often confused with dummy coding.
Coding Sidebar
When creating dummy coded features from factors that have levels with infrequent observations, you may occasionally end up with novel levels in your validation or test sets that were not present in your training set.
This will cause you issues.
These issues are mostly resolved if you make sure to explicitly list all possible levels for a factor when classing that factor in the training data, even if the level doesn’t exist in the training data.
We provide more detail on this issue in an appendix.
3.5.2 Nominal Predictors
Now that we understand how to use dummy coding to feature engineer nominal predictors, let’s consider some potentially important ones that are available to us.
Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
We might:
Represent it with 6 dummy features (because there are 7 raw levels) but many of the categories are very low n - won’t account for much variance?
Combine all the commercial categories (agri, commer, indus), which would take care of most of the low n groups. They also all tend to have the lower prices.
Combine all the residential to get a better feature to variance accounted ratio. They all tend to have similar prices on average and res_high is also pretty low n.
Data dictionary entry: Identifies the general zoning classification of the sale.
There is not much difference in median sale_price among categories
Not very promising
Data dictionary entry: Type of dwelling
one_fam: Single-family Detached
two_fam: Two-family Conversion; originally built as one-family dwelling
duplex: Duplex
town_end: Townhouse End Unit
town_inside: Townhouse Inside Unit
Let’s do some feature engineering with ms_zoning. We can now do this formally in a recipe so that it can be used in our modeling workflow.
First, if you noticed earlier, there are some levels for ms_zoning that are pretty infrequent. Lets make sure both data_trn and data_val have all levels set for this factor.
[Note: Ideally, you would go back to cleaning EDA and add this level to the full dataset and then re-split into training, validation and test. This is a sloppy shortcut!]
With that fixed, let’s proceed:
We will collapse categories down to three levels (commercial, residential, floating village) as before but now using step_mutate() combined with fct_collapse() to do this inside of our recipe.
We will convert to dummy features using step_dummy(). The first level of the factor will be set to the reference level when we call step_dummy().
step_dummy() is a poor choice for function name. It actually uses whatever contrast coding we have set up in R. However, the default is are dummy coded contrasts (R calls this treatment contrasts). See ?contrasts and options("contrasts") for more info.
You should also read more about some other step_() functions that you might use for categorical predictors: - step_other() to combine all low frequency categories into a single “other” category. - step_unknown() to assign missing values their own category - You can use selector functions. For example, you could make dummy variables out of all of your factors in one step using step_dummy(all_nominal_predictors()).
# A tibble: 4 × 2
model rmse_val
<chr> <dbl>
1 simple linear model 48671.
2 4 feature linear model 37734.
3 4 feature linear model with YJ 39270.
4 6 feature linear model w/ms_zoning 37630.
Removing Yeo Johnson transformation but adding dummy coded ms_zoning may have helped a little
Question
Will the addition of new predictors/features to a model always reduce RMSE in train? in validation?
As you know, the estimation procedure in linear models is OLS. Parameter estimates are derived to minimize the SSE in the data set in which they are derived. For this reason, adding a predictor will never increase RMSE in the training set and it will usually lower it even when it is not part of the DGP.
However, this is not true in validation. A predictor will only meaningfully lower RMSE in validation if it is part of the DGP. Also, a bad predictor could even increase RMSE in validation due tooverfitting.
3.5.3 Ordinal Predictors
We have two paths to pursue for ordinal predictors
We can treat them like nominal predictors (e.g., dummy code)
We can treat them like numeric predictors (either raw or with an added transformation if needed)
Low frequency for low and to some degree high quality response options. If dummy coding, may want to collapse some (1-2)
There is a monotonic relationship (mostly linear) with sale_price. Treat as numeric?
Not skewed so doesn’t likely need to be transformed if treated as numeric
Numeric will take one feature vs. many (9?) features for dummy codes.
Dummy codes are more flexible but we may not need this flexibility (and unnecessary flexibility increases overfitting)
Let’s add overall_qual to our model as numeric
Remember that this predictor was ordinal so we paid special attention to the order of the levels when we classed this factor. Lets confirm they are in order
Code
data_trn |>pull(overall_qual) |>levels()
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
To convert overall_qual to numeric (with levels in the specified order), we can use another simple mutate inside our recipe.
There is a step function called step_ordinalscore() but it requires that the factor is classed as an ordered factor. It is also more complicated than needed in our opinion. Just use as.numeric()
Let’s evaluate this model
Making features
Skipping the skim to save space (we promised we checked it previously!)
# A tibble: 5 × 2
model rmse_val
<chr> <dbl>
1 simple linear model 48671.
2 4 feature linear model 37734.
3 4 feature linear model with YJ 39270.
4 6 feature linear model w/ms_zoning 37630.
5 7 feature linear model 32447.
That helped!
3.6 Extensions to Interactive Models and Non-linear Models
3.6.1 Interactions
There may be interactive effects among our predictors
Some statistical algorithms (e.g., KNN) can naturally accommodate interactive effects without any feature engineering
Linear models cannot
Nothing to fear, tidymodels makes it easy to feature engineer interactions
[BUT - as we will learn, we generally think that if you expect lots of interactions, the linear model may not be the best model to use]
For example, it may be that the relationship between year_built and sale_price depends on overall_qual.
Old houses are expensive if they are in good condition
but old houses are very cheap if they are in poor condition
In the tidymodels framework
Coding interactions is done by feature engineering, not by formula (Note that formula does not change below in recipe)
This seems appropriate to us as we are making new features to represent interactions
We still use an R formula like interface to specify the interaction term features that will be created
# A tibble: 6 × 2
model rmse_val
<chr> <dbl>
1 simple linear model 48671.
2 4 feature linear model 37734.
3 4 feature linear model with YJ 39270.
4 6 feature linear model w/ms_zoning 37630.
5 7 feature linear model 32447.
6 8 feature linear model w/interaction 31768.
That helped!
You can also feature engineer interactions with nominal (and ordinal predictors treated as nominal) predictors
The nominal predictors should first be converted to dummy code features
You will indicate the interactions using the variable names that will be assigned to these dummy code features
Use starts_with() or matches() to make it easy if there are many features associated with a categorical predictor
Can use “~ .^2” to include all two way interactions (be careful if you have dummy coded features!)
Let’s code an interaction between ms_zoning & year_built.
Old homes are cool
Old commercial spaces are never cool
Maybe this is why the main effect of ms_zoning wasn’t useful
error_val <- error_val |>bind_rows(tibble(model ="10 feature linear model w/interactions", rmse_val =rmse_vec(feat_val$sale_price,predict(fit_lm_10, feat_val)$.pred)))error_val
# A tibble: 7 × 2
model rmse_val
<chr> <dbl>
1 simple linear model 48671.
2 4 feature linear model 37734.
3 4 feature linear model with YJ 39270.
4 6 feature linear model w/ms_zoning 37630.
5 7 feature linear model 32447.
6 8 feature linear model w/interaction 31768.
7 10 feature linear model w/interactions 31717.
Not really any better
Shouldn’t just include all interactions without reason
Either you have done EDA to support them or
You have substantive interest in them (explanatory question)
If you want all interactions, use a statistical algorithm that supports those relationships without feature engineering (e.g., KNN, random forest and other decision trees)
3.6.2 Non-linear Models
We may also want to model non-linear effects of our predictors
Some non-parametric models can accommodate non-linear effects without feature engineering (e.g., KNN, Random Forest).
Non-linear effects can be accommodated in a linear model with feature engineering
Ordinal predictors can be coded with dummy variables
Numeric predictors can be split at threshold
Polynomial contrasts for numeric or ordinal predictors (see step_poly())
We will continue to explore these options throughout the course
3.7 KNN Regression
K Nearest Neighbor
Is a non-parametric regression and classification statistical algorithm
It does not yield specific parameter estimates for features/predictors (or statistical tests for those parameter estimates)
There are still ways to use it to address explanatory questions (visualizations, model comparisons, feature importance)
Very simple but also powerful (listed commonly among top 10 algorithms)
By powerful, it is quite flexible and can accommodate many varied DGPs without the need for much feature engineering with its predictors
May not need most transformations of X or Y
May not need to model interactions
Still need to handle missing data, outliers, and categorical predictors
K Nearest Neighbor
Algorithm “memorizes” the training set (lazy learning)
Lazy learning is most useful for large, continuously changing datasets with few attributes (features) that are commonly queried (e.g., online recommendation systems)
Prediction for any new observation is based on \(k\) most similar observations from the dataset
\(k\) provides direct control over the bias-variance trade-off for this algorithm
To better understand KNN let’s simulate training data for three different DGPs (linear - y, polynomial - y2, and step - y3)
Let’s start with a simple example where the DGP for Y is linear on one predictor (X)
DGP: \(y = rnorm(150, x, 10)\)
This figure displays:
DGP
Prediction line from a simple linear model
Red lines to represent three new observations (X = 10, 50, 90) we want to make predictions for via a standard KNN
Question
What would 5-NN predictions look like for each of these three new values of X in the figure above?
For x = 10, find the five observations that have X values closest to 10. Average the Y values for those 5 observations and that is your predicted Y associated with that new value of X.
Repeat to make predictions for Y for any other value of X,e.g., 50, 90, or any other value
KNN can easily accommodate non-linear relationships between numeric predictors and outcomes without any feature engineering for predictors
In fact, it can flexibly handle any shape of relationship
DGP: \(y2 = rnorm(150, x^4 / 800000, 8)\)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
fit_5nn_demo <-nearest_neighbor() |>set_engine("kknn") |>set_mode("regression") |>fit(y2 ~ ., data = feat_trn_demo)
Get features for a validation set (a new sample using same polynomial DGP)
Code
feat_val_demo <- rec_prep |>bake(data_val_demo)
Display 5NN predictions in validation
KNN (with k = 5) does a pretty good job of representing the shape of the DGP (low bias)
KNN displays some (but minimal) evidence of overfitting
Simple linear model does not perform well (clear/high bias)
Let’s pause and consider our conceptual understanding of the impact of \(k\) on the bias-variance trade-off
Question
How will the size of k influence model performance (e.g., bias, overfitting/variance)?
Smaller values of k will tend to increase overfitting (and therefore variance across training samples) but decrease bias. Larger values of k will tend to decrease overfitting but increase bias. We need to find the Goldilocks “sweet spot”
Question
How will k = 1 perform in training and validation sets?
k = 1 will perfectly fit the training set. Therefore it is very dependent on the training set (high variance). It will fit both the DGP and the noise in the training set. Clearly it will likely not do as well in validation (it will be overfit to training).
k needs to be larger if there is more noise (to average over more cases). k needs to be smaller if the relationships are complex. (More on choosing k by resampling in unit 5.
To make a prediction for some new observation, we need to identify the observations from the training set that are nearest to it
Need a distance measure to define “nearest”
IMPORTANT: We care only about:
Distance between a validation observation and all the training observations
Need to find the \(k\) observations in training that are nearest to the validation observation (i.e., its neighbors)
Distance is defined based on these observations’ features, not their outcomes
There are a number of different distance measures available (e.g., Euclidean, Manhattan, Chebyshev, Cosine, Minkowski)
Euclidean is most commonly used in KNN
Euclidean distance between any two points is an n-dimensional extension of the Pythagorean formula (which applies explicitly with 2 features/2 dimensional space).
\(C^2 = A^2 + B^2\)
\(C = \sqrt{A^2 + B^2}\)
…where C is the distance between two points
The Euclidean distance between 2 points (p and q) in two dimensions (2 predictors, x1 = A, x2 = B)
\(Distance = \sqrt{A^2 + B^2}\)
\(Distance = \sqrt{(q1 - p1)^2 + (q2 - p2)^2}\)
\(Distance = \sqrt{(2 - 1)^2 + (5 - 2)^2}\)
\(Distance = 3.2\)
One dimensional (one feature) is simply the subtraction of scores on that feature (x1) between p and q
# A tibble: 8 × 2
model rmse_val
<chr> <dbl>
1 simple linear model 48671.
2 4 feature linear model 37734.
3 4 feature linear model with YJ 39270.
4 6 feature linear model w/ms_zoning 37630.
5 7 feature linear model 32447.
6 8 feature linear model w/interaction 31768.
7 10 feature linear model w/interactions 31717.
8 5 numeric predictor 5nn 31096.
Not bad!
KNN also mostly solved the linearity problem
We might be able to improve the linear models with better transformations of X and Y
# A tibble: 9 × 2
model rmse_val
<chr> <dbl>
1 simple linear model 48671.
2 4 feature linear model 37734.
3 4 feature linear model with YJ 39270.
4 6 feature linear model w/ms_zoning 37630.
5 7 feature linear model 32447.
6 8 feature linear model w/interaction 31768.
7 10 feature linear model w/interactions 31717.
8 5 numeric predictor 5nn 31096.
9 5 numeric predictor 20nn 28324.
That helped some
One more time with k = 50 to see where we are in the bias-variance function
# A tibble: 10 × 2
model rmse_val
<chr> <dbl>
1 simple linear model 48671.
2 4 feature linear model 37734.
3 4 feature linear model with YJ 39270.
4 6 feature linear model w/ms_zoning 37630.
5 7 feature linear model 32447.
6 8 feature linear model w/interaction 31768.
7 10 feature linear model w/interactions 31717.
8 5 numeric predictor 5nn 31096.
9 5 numeric predictor 20nn 28324.
10 5 numeric predictor 50nn 29082.
Too high, now we have bias……
We will learn a more rigorous method for selecting the optimal value for \(k\) (i.e., tuning this hyperparameter) in unit 5
To better understand bias-variance trade-off, let’s look at error across these three values of \(k\) in train and validation for Ames
Training
Remember that training error would be 0 for k = 1
Training error is increasing as \(k\) increases b/c it KNN is overfitting less (so its not fitting the noise in train as well)
# A tibble: 11 × 2
model rmse_val
<chr> <dbl>
1 simple linear model 48671.
2 4 feature linear model 37734.
3 4 feature linear model with YJ 39270.
4 6 feature linear model w/ms_zoning 37630.
5 7 feature linear model 32447.
6 8 feature linear model w/interaction 31768.
7 10 feature linear model w/interactions 31717.
8 5 numeric predictor 5nn 31096.
9 5 numeric predictor 20nn 28324.
10 5 numeric predictor 50nn 29082.
11 5 numeric predictor 20nn with ms_zoning 27924.
Now it helps.
Might have to do with interactions with other predictors that we didn’t model in the linear model
KNN automatically accommodates interactions. Why?
This model is a bit more complex and might benefit further from higher \(k\)
As a teaser, here is another performance metric for this model - \(R^2\). Not too shabby! Remember, there is certainly some irreducible error in sale_price that will put a ceiling on \(R^2\) and a floor on RMSE
Overall, we now have a model that predicts housing prices with about 30K of RMSE and accounting for 84% of the variance. I am sure you can improve on this!