Unit 09 Lab Agenda

Author

Coco Yu

Published

April 1, 2025

Package Conflicts

  • Why do I constantly see conflict error when doing resampling especially with tidyverse and Matrix packages? The only way I can get around it is to mannually library the Matrix package and exclude the function that is conflicting with the ones in tidyverse.
library(Matrix, exclude = c("expand", "pack", "unpack"))

Methods to speed up EDA (skimr) when dealing with large-scale data

library(skimr)

ames <- ames |> janitor::clean_names()
ames |> skim()
Data summary
Name ames
Number of rows 2930
Number of columns 74
_______________________
Column type frequency:
factor 40
numeric 34
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
ms_sub_class 0 1 FALSE 16 One: 1079, Two: 575, One: 287, One: 192
ms_zoning 0 1 FALSE 7 Res: 2273, Res: 462, Flo: 139, Res: 27
street 0 1 FALSE 2 Pav: 2918, Grv: 12
alley 0 1 FALSE 3 No_: 2732, Gra: 120, Pav: 78
lot_shape 0 1 FALSE 4 Reg: 1859, Sli: 979, Mod: 76, Irr: 16
land_contour 0 1 FALSE 4 Lvl: 2633, HLS: 120, Bnk: 117, Low: 60
utilities 0 1 FALSE 3 All: 2927, NoS: 2, NoS: 1
lot_config 0 1 FALSE 5 Ins: 2140, Cor: 511, Cul: 180, FR2: 85
land_slope 0 1 FALSE 3 Gtl: 2789, Mod: 125, Sev: 16
neighborhood 0 1 FALSE 28 Nor: 443, Col: 267, Old: 239, Edw: 194
condition_1 0 1 FALSE 9 Nor: 2522, Fee: 164, Art: 92, RRA: 50
condition_2 0 1 FALSE 8 Nor: 2900, Fee: 13, Art: 5, Pos: 4
bldg_type 0 1 FALSE 5 One: 2425, Twn: 233, Dup: 109, Twn: 101
house_style 0 1 FALSE 8 One: 1481, Two: 873, One: 314, SLv: 128
overall_cond 0 1 FALSE 9 Ave: 1654, Abo: 533, Goo: 390, Ver: 144
roof_style 0 1 FALSE 6 Gab: 2321, Hip: 551, Gam: 22, Fla: 20
roof_matl 0 1 FALSE 8 Com: 2887, Tar: 23, WdS: 9, WdS: 7
exterior_1st 0 1 FALSE 16 Vin: 1026, Met: 450, HdB: 442, Wd : 420
exterior_2nd 0 1 FALSE 17 Vin: 1015, Met: 447, HdB: 406, Wd : 397
mas_vnr_type 0 1 FALSE 5 Non: 1775, Brk: 880, Sto: 249, Brk: 25
exter_cond 0 1 FALSE 5 Typ: 2549, Goo: 299, Fai: 67, Exc: 12
foundation 0 1 FALSE 6 PCo: 1310, CBl: 1244, Brk: 311, Sla: 49
bsmt_cond 0 1 FALSE 6 Typ: 2616, Goo: 122, Fai: 104, No_: 80
bsmt_exposure 0 1 FALSE 5 No: 1906, Av: 418, Gd: 284, Mn: 239
bsmt_fin_type_1 0 1 FALSE 7 GLQ: 859, Unf: 851, ALQ: 429, Rec: 288
bsmt_fin_type_2 0 1 FALSE 7 Unf: 2499, Rec: 106, LwQ: 89, No_: 81
heating 0 1 FALSE 6 Gas: 2885, Gas: 27, Gra: 9, Wal: 6
heating_qc 0 1 FALSE 5 Exc: 1495, Typ: 864, Goo: 476, Fai: 92
central_air 0 1 FALSE 2 Y: 2734, N: 196
electrical 0 1 FALSE 6 SBr: 2682, Fus: 188, Fus: 50, Fus: 8
functional 0 1 FALSE 8 Typ: 2728, Min: 70, Min: 65, Mod: 35
garage_type 0 1 FALSE 7 Att: 1731, Det: 782, Bui: 186, No_: 157
garage_finish 0 1 FALSE 4 Unf: 1231, RFn: 812, Fin: 728, No_: 159
garage_cond 0 1 FALSE 6 Typ: 2665, No_: 159, Fai: 74, Goo: 15
paved_drive 0 1 FALSE 3 Pav: 2652, Dir: 216, Par: 62
pool_qc 0 1 FALSE 5 No_: 2917, Exc: 4, Goo: 4, Typ: 3
fence 0 1 FALSE 5 No_: 2358, Min: 330, Goo: 118, Goo: 112
misc_feature 0 1 FALSE 6 Non: 2824, She: 95, Gar: 5, Oth: 4
sale_type 0 1 FALSE 10 WD : 2536, New: 239, COD: 87, Con: 26
sale_condition 0 1 FALSE 6 Nor: 2413, Par: 245, Abn: 190, Fam: 46

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
lot_frontage 0 1 57.65 33.50 0.00 43.00 63.00 78.00 313.00 ▇▇▁▁▁
lot_area 0 1 10147.92 7880.02 1300.00 7440.25 9436.50 11555.25 215245.00 ▇▁▁▁▁
year_built 0 1 1971.36 30.25 1872.00 1954.00 1973.00 2001.00 2010.00 ▁▂▃▆▇
year_remod_add 0 1 1984.27 20.86 1950.00 1965.00 1993.00 2004.00 2010.00 ▅▂▂▃▇
mas_vnr_area 0 1 101.10 178.63 0.00 0.00 0.00 162.75 1600.00 ▇▁▁▁▁
bsmt_fin_sf_1 0 1 4.18 2.23 0.00 3.00 3.00 7.00 7.00 ▃▂▇▁▇
bsmt_fin_sf_2 0 1 49.71 169.14 0.00 0.00 0.00 0.00 1526.00 ▇▁▁▁▁
bsmt_unf_sf 0 1 559.07 439.54 0.00 219.00 465.50 801.75 2336.00 ▇▅▂▁▁
total_bsmt_sf 0 1 1051.26 440.97 0.00 793.00 990.00 1301.50 6110.00 ▇▃▁▁▁
first_flr_sf 0 1 1159.56 391.89 334.00 876.25 1084.00 1384.00 5095.00 ▇▃▁▁▁
second_flr_sf 0 1 335.46 428.40 0.00 0.00 0.00 703.75 2065.00 ▇▃▂▁▁
gr_liv_area 0 1 1499.69 505.51 334.00 1126.00 1442.00 1742.75 5642.00 ▇▇▁▁▁
bsmt_full_bath 0 1 0.43 0.52 0.00 0.00 0.00 1.00 3.00 ▇▆▁▁▁
bsmt_half_bath 0 1 0.06 0.25 0.00 0.00 0.00 0.00 2.00 ▇▁▁▁▁
full_bath 0 1 1.57 0.55 0.00 1.00 2.00 2.00 4.00 ▁▇▇▁▁
half_bath 0 1 0.38 0.50 0.00 0.00 0.00 1.00 2.00 ▇▁▅▁▁
bedroom_abv_gr 0 1 2.85 0.83 0.00 2.00 3.00 3.00 8.00 ▁▇▂▁▁
kitchen_abv_gr 0 1 1.04 0.21 0.00 1.00 1.00 1.00 3.00 ▁▇▁▁▁
tot_rms_abv_grd 0 1 6.44 1.57 2.00 5.00 6.00 7.00 15.00 ▁▇▂▁▁
fireplaces 0 1 0.60 0.65 0.00 0.00 1.00 1.00 4.00 ▇▇▁▁▁
garage_cars 0 1 1.77 0.76 0.00 1.00 2.00 2.00 5.00 ▅▇▂▁▁
garage_area 0 1 472.66 215.19 0.00 320.00 480.00 576.00 1488.00 ▃▇▃▁▁
wood_deck_sf 0 1 93.75 126.36 0.00 0.00 0.00 168.00 1424.00 ▇▁▁▁▁
open_porch_sf 0 1 47.53 67.48 0.00 0.00 27.00 70.00 742.00 ▇▁▁▁▁
enclosed_porch 0 1 23.01 64.14 0.00 0.00 0.00 0.00 1012.00 ▇▁▁▁▁
three_season_porch 0 1 2.59 25.14 0.00 0.00 0.00 0.00 508.00 ▇▁▁▁▁
screen_porch 0 1 16.00 56.09 0.00 0.00 0.00 0.00 576.00 ▇▁▁▁▁
pool_area 0 1 2.24 35.60 0.00 0.00 0.00 0.00 800.00 ▇▁▁▁▁
misc_val 0 1 50.64 566.34 0.00 0.00 0.00 0.00 17000.00 ▇▁▁▁▁
mo_sold 0 1 6.22 2.71 1.00 4.00 6.00 8.00 12.00 ▅▆▇▃▃
year_sold 0 1 2007.79 1.32 2006.00 2007.00 2008.00 2009.00 2010.00 ▇▇▇▇▃
sale_price 0 1 180796.06 79886.69 12789.00 129500.00 160000.00 213500.00 755000.00 ▇▇▁▁▁
longitude 0 1 -93.64 0.03 -93.69 -93.66 -93.64 -93.62 -93.58 ▅▅▇▆▁
latitude 0 1 42.03 0.02 41.99 42.02 42.03 42.05 42.06 ▂▂▇▇▇
# select a subset of summary statistics
my_skim <- skim_with(numeric = sfl(hist = NULL, p0 = NULL, p25 = NULL,
                                   p75 = NULL, p100 = NULL))
ames |> my_skim()
Data summary
Name ames
Number of rows 2930
Number of columns 74
_______________________
Column type frequency:
factor 40
numeric 34
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
ms_sub_class 0 1 FALSE 16 One: 1079, Two: 575, One: 287, One: 192
ms_zoning 0 1 FALSE 7 Res: 2273, Res: 462, Flo: 139, Res: 27
street 0 1 FALSE 2 Pav: 2918, Grv: 12
alley 0 1 FALSE 3 No_: 2732, Gra: 120, Pav: 78
lot_shape 0 1 FALSE 4 Reg: 1859, Sli: 979, Mod: 76, Irr: 16
land_contour 0 1 FALSE 4 Lvl: 2633, HLS: 120, Bnk: 117, Low: 60
utilities 0 1 FALSE 3 All: 2927, NoS: 2, NoS: 1
lot_config 0 1 FALSE 5 Ins: 2140, Cor: 511, Cul: 180, FR2: 85
land_slope 0 1 FALSE 3 Gtl: 2789, Mod: 125, Sev: 16
neighborhood 0 1 FALSE 28 Nor: 443, Col: 267, Old: 239, Edw: 194
condition_1 0 1 FALSE 9 Nor: 2522, Fee: 164, Art: 92, RRA: 50
condition_2 0 1 FALSE 8 Nor: 2900, Fee: 13, Art: 5, Pos: 4
bldg_type 0 1 FALSE 5 One: 2425, Twn: 233, Dup: 109, Twn: 101
house_style 0 1 FALSE 8 One: 1481, Two: 873, One: 314, SLv: 128
overall_cond 0 1 FALSE 9 Ave: 1654, Abo: 533, Goo: 390, Ver: 144
roof_style 0 1 FALSE 6 Gab: 2321, Hip: 551, Gam: 22, Fla: 20
roof_matl 0 1 FALSE 8 Com: 2887, Tar: 23, WdS: 9, WdS: 7
exterior_1st 0 1 FALSE 16 Vin: 1026, Met: 450, HdB: 442, Wd : 420
exterior_2nd 0 1 FALSE 17 Vin: 1015, Met: 447, HdB: 406, Wd : 397
mas_vnr_type 0 1 FALSE 5 Non: 1775, Brk: 880, Sto: 249, Brk: 25
exter_cond 0 1 FALSE 5 Typ: 2549, Goo: 299, Fai: 67, Exc: 12
foundation 0 1 FALSE 6 PCo: 1310, CBl: 1244, Brk: 311, Sla: 49
bsmt_cond 0 1 FALSE 6 Typ: 2616, Goo: 122, Fai: 104, No_: 80
bsmt_exposure 0 1 FALSE 5 No: 1906, Av: 418, Gd: 284, Mn: 239
bsmt_fin_type_1 0 1 FALSE 7 GLQ: 859, Unf: 851, ALQ: 429, Rec: 288
bsmt_fin_type_2 0 1 FALSE 7 Unf: 2499, Rec: 106, LwQ: 89, No_: 81
heating 0 1 FALSE 6 Gas: 2885, Gas: 27, Gra: 9, Wal: 6
heating_qc 0 1 FALSE 5 Exc: 1495, Typ: 864, Goo: 476, Fai: 92
central_air 0 1 FALSE 2 Y: 2734, N: 196
electrical 0 1 FALSE 6 SBr: 2682, Fus: 188, Fus: 50, Fus: 8
functional 0 1 FALSE 8 Typ: 2728, Min: 70, Min: 65, Mod: 35
garage_type 0 1 FALSE 7 Att: 1731, Det: 782, Bui: 186, No_: 157
garage_finish 0 1 FALSE 4 Unf: 1231, RFn: 812, Fin: 728, No_: 159
garage_cond 0 1 FALSE 6 Typ: 2665, No_: 159, Fai: 74, Goo: 15
paved_drive 0 1 FALSE 3 Pav: 2652, Dir: 216, Par: 62
pool_qc 0 1 FALSE 5 No_: 2917, Exc: 4, Goo: 4, Typ: 3
fence 0 1 FALSE 5 No_: 2358, Min: 330, Goo: 118, Goo: 112
misc_feature 0 1 FALSE 6 Non: 2824, She: 95, Gar: 5, Oth: 4
sale_type 0 1 FALSE 10 WD : 2536, New: 239, COD: 87, Con: 26
sale_condition 0 1 FALSE 6 Nor: 2413, Par: 245, Abn: 190, Fam: 46

Variable type: numeric

skim_variable n_missing complete_rate mean sd p50
lot_frontage 0 1 57.65 33.50 63.00
lot_area 0 1 10147.92 7880.02 9436.50
year_built 0 1 1971.36 30.25 1973.00
year_remod_add 0 1 1984.27 20.86 1993.00
mas_vnr_area 0 1 101.10 178.63 0.00
bsmt_fin_sf_1 0 1 4.18 2.23 3.00
bsmt_fin_sf_2 0 1 49.71 169.14 0.00
bsmt_unf_sf 0 1 559.07 439.54 465.50
total_bsmt_sf 0 1 1051.26 440.97 990.00
first_flr_sf 0 1 1159.56 391.89 1084.00
second_flr_sf 0 1 335.46 428.40 0.00
gr_liv_area 0 1 1499.69 505.51 1442.00
bsmt_full_bath 0 1 0.43 0.52 0.00
bsmt_half_bath 0 1 0.06 0.25 0.00
full_bath 0 1 1.57 0.55 2.00
half_bath 0 1 0.38 0.50 0.00
bedroom_abv_gr 0 1 2.85 0.83 3.00
kitchen_abv_gr 0 1 1.04 0.21 1.00
tot_rms_abv_grd 0 1 6.44 1.57 6.00
fireplaces 0 1 0.60 0.65 1.00
garage_cars 0 1 1.77 0.76 2.00
garage_area 0 1 472.66 215.19 480.00
wood_deck_sf 0 1 93.75 126.36 0.00
open_porch_sf 0 1 47.53 67.48 27.00
enclosed_porch 0 1 23.01 64.14 0.00
three_season_porch 0 1 2.59 25.14 0.00
screen_porch 0 1 16.00 56.09 0.00
pool_area 0 1 2.24 35.60 0.00
misc_val 0 1 50.64 566.34 0.00
mo_sold 0 1 6.22 2.71 6.00
year_sold 0 1 2007.79 1.32 2008.00
sale_price 0 1 180796.06 79886.69 160000.00
longitude 0 1 -93.64 0.03 -93.64
latitude 0 1 42.03 0.02 42.03
# select a subset of columns
ames |> summarize(across(where(is.numeric), ~median(.x, na.rm = TRUE)))
# A tibble: 1 × 34
  lot_frontage lot_area year_built year_remod_add mas_vnr_area bsmt_fin_sf_1
         <dbl>    <dbl>      <dbl>          <dbl>        <dbl>         <dbl>
1           63    9436.       1973           1993            0             3
  bsmt_fin_sf_2 bsmt_unf_sf total_bsmt_sf first_flr_sf second_flr_sf gr_liv_area
          <dbl>       <dbl>         <dbl>        <dbl>         <dbl>       <dbl>
1             0        466.           990         1084             0        1442
  bsmt_full_bath bsmt_half_bath full_bath half_bath bedroom_abv_gr
           <dbl>          <dbl>     <dbl>     <dbl>          <dbl>
1              0              0         2         0              3
  kitchen_abv_gr tot_rms_abv_grd fireplaces garage_cars garage_area wood_deck_sf
           <dbl>           <dbl>      <dbl>       <dbl>       <dbl>        <dbl>
1              1               6          1           2         480            0
  open_porch_sf enclosed_porch three_season_porch screen_porch pool_area
          <dbl>          <dbl>              <dbl>        <dbl>     <dbl>
1            27              0                  0            0         0
  misc_val mo_sold year_sold sale_price longitude latitude
     <dbl>   <dbl>     <dbl>      <dbl>     <dbl>    <dbl>
1        0       6      2008     160000     -93.6     42.0

Graph Interpretation

Example decision tree graph with different tree depths (min_n)

set.seed(123)
data_trn <- ames |> 
  initial_split(prop = 3/4, strata = "sale_price", breaks = 4) |> 
  analysis()
data_test <- ames |> 
  initial_split(prop = 3/4, strata = "sale_price", breaks = 4) |>
  assessment()

rec <- recipe(sale_price ~ ., data = data_trn)
rec_prep <- rec |> 
  prep(data_trn)
feat_trn <- rec_prep |> bake(NULL)
feat_test <- rec_prep |> bake(data_test)
feat_trn |> skim_some()
Data summary
Name feat_trn
Number of rows 2197
Number of columns 74
_______________________
Column type frequency:
factor 40
numeric 34
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
ms_sub_class 0 1 FALSE 16 One: 796, Two: 436, One: 217, One: 154
ms_zoning 0 1 FALSE 7 Res: 1695, Res: 350, Flo: 109, C_a: 21
street 0 1 FALSE 2 Pav: 2186, Grv: 11
alley 0 1 FALSE 3 No_: 2043, Gra: 90, Pav: 64
lot_shape 0 1 FALSE 4 Reg: 1377, Sli: 755, Mod: 56, Irr: 9
land_contour 0 1 FALSE 4 Lvl: 1961, Bnk: 98, HLS: 94, Low: 44
utilities 0 1 FALSE 3 All: 2194, NoS: 2, NoS: 1
lot_config 0 1 FALSE 5 Ins: 1609, Cor: 380, Cul: 129, FR2: 67
land_slope 0 1 FALSE 3 Gtl: 2087, Mod: 100, Sev: 10
neighborhood 0 1 FALSE 28 Nor: 326, Col: 204, Old: 175, Edw: 146
condition_1 0 1 FALSE 9 Nor: 1881, Fee: 127, Art: 71, RRA: 37
condition_2 0 1 FALSE 7 Nor: 2175, Fee: 10, Art: 4, Pos: 3
bldg_type 0 1 FALSE 5 One: 1802, Twn: 187, Dup: 84, Twn: 77
house_style 0 1 FALSE 8 One: 1109, Two: 661, One: 231, SLv: 98
overall_cond 0 1 FALSE 9 Ave: 1257, Abo: 386, Goo: 288, Ver: 111
roof_style 0 1 FALSE 6 Gab: 1745, Hip: 413, Gam: 16, Fla: 12
roof_matl 0 1 FALSE 6 Com: 2168, Tar: 16, WdS: 6, WdS: 5
exterior_1st 0 1 FALSE 15 Vin: 762, HdB: 341, Met: 339, Wd : 307
exterior_2nd 0 1 FALSE 16 Vin: 758, Met: 333, HdB: 305, Wd : 290
mas_vnr_type 0 1 FALSE 5 Non: 1330, Brk: 659, Sto: 190, Brk: 17
exter_cond 0 1 FALSE 5 Typ: 1919, Goo: 219, Fai: 51, Exc: 7
foundation 0 1 FALSE 6 PCo: 991, CBl: 928, Brk: 226, Sla: 41
bsmt_cond 0 1 FALSE 6 Typ: 1961, Fai: 84, Goo: 81, No_: 64
bsmt_exposure 0 1 FALSE 5 No: 1439, Av: 312, Gd: 203, Mn: 176
bsmt_fin_type_1 0 1 FALSE 7 GLQ: 654, Unf: 651, ALQ: 315, Rec: 202
bsmt_fin_type_2 0 1 FALSE 7 Unf: 1880, Rec: 75, No_: 64, LwQ: 63
heating 0 1 FALSE 6 Gas: 2158, Gas: 23, Gra: 8, Wal: 6
heating_qc 0 1 FALSE 5 Exc: 1130, Typ: 630, Goo: 357, Fai: 77
central_air 0 1 FALSE 2 Y: 2047, N: 150
electrical 0 1 FALSE 5 SBr: 2010, Fus: 141, Fus: 40, Fus: 5
functional 0 1 FALSE 8 Typ: 2039, Min: 54, Min: 54, Mod: 24
garage_type 0 1 FALSE 7 Att: 1288, Det: 591, Bui: 138, No_: 122
garage_finish 0 1 FALSE 4 Unf: 920, RFn: 616, Fin: 537, No_: 124
garage_cond 0 1 FALSE 6 Typ: 1996, No_: 124, Fai: 52, Poo: 12
paved_drive 0 1 FALSE 3 Pav: 1985, Dir: 166, Par: 46
pool_qc 0 1 FALSE 5 No_: 2186, Goo: 4, Typ: 3, Exc: 2
fence 0 1 FALSE 5 No_: 1780, Min: 236, Goo: 87, Goo: 86
misc_feature 0 1 FALSE 6 Non: 2123, She: 65, Oth: 4, Gar: 3
sale_type 0 1 FALSE 10 WD : 1906, New: 178, COD: 64, Con: 18
sale_condition 0 1 FALSE 6 Nor: 1800, Par: 183, Abn: 148, Fam: 38

Variable type: numeric

skim_variable n_missing complete_rate p0 p100
lot_frontage 0 1 0.00 313.00
lot_area 0 1 1300.00 215245.00
year_built 0 1 1872.00 2010.00
year_remod_add 0 1 1950.00 2010.00
mas_vnr_area 0 1 0.00 1600.00
bsmt_fin_sf_1 0 1 0.00 7.00
bsmt_fin_sf_2 0 1 0.00 1393.00
bsmt_unf_sf 0 1 0.00 2336.00
total_bsmt_sf 0 1 0.00 6110.00
first_flr_sf 0 1 334.00 5095.00
second_flr_sf 0 1 0.00 1872.00
gr_liv_area 0 1 334.00 5642.00
bsmt_full_bath 0 1 0.00 3.00
bsmt_half_bath 0 1 0.00 2.00
full_bath 0 1 0.00 4.00
half_bath 0 1 0.00 2.00
bedroom_abv_gr 0 1 0.00 8.00
kitchen_abv_gr 0 1 0.00 3.00
tot_rms_abv_grd 0 1 2.00 15.00
fireplaces 0 1 0.00 4.00
garage_cars 0 1 0.00 4.00
garage_area 0 1 0.00 1488.00
wood_deck_sf 0 1 0.00 870.00
open_porch_sf 0 1 0.00 742.00
enclosed_porch 0 1 0.00 1012.00
three_season_porch 0 1 0.00 508.00
screen_porch 0 1 0.00 576.00
pool_area 0 1 0.00 800.00
misc_val 0 1 0.00 17000.00
mo_sold 0 1 1.00 12.00
year_sold 0 1 2006.00 2010.00
longitude 0 1 -93.69 -93.58
latitude 0 1 41.99 42.06
sale_price 0 1 12789.00 755000.00
# Decision Tree
fit_tree_1 <-   
  decision_tree(tree_depth = 1, min_n = 2, cost_complexity = 0) |>
  set_engine("rpart", model = TRUE) |>
  set_mode("regression") |>  
  fit(sale_price ~ ., data = feat_trn)
fit_tree_1$fit |> rpart.plot::rpart.plot()

fit_tree_2 <-   
  decision_tree(tree_depth = 2, min_n = 2, cost_complexity = 0) |>
  set_engine("rpart", model = TRUE) |>
  set_mode("regression") |>  
  fit(sale_price ~ ., data = feat_trn)
fit_tree_2$fit |> rpart.plot::rpart.plot()

fit_tree_3 <-   
  decision_tree(tree_depth = 3, min_n = 2, cost_complexity = 0) |>
  set_engine("rpart", model = TRUE) |>
  set_mode("regression") |>  
  fit(sale_price ~ ., data = feat_trn)
fit_tree_3$fit |> rpart.plot::rpart.plot()

Decision tree is only somewhat interpretable when your tree structure is simple. As the tree depth increases, the tree becomes more complex and harder to interpret.

Bagged trees and random forests are inherently not interpretable. However, we will introduce a few model agnostic approaches in later chapters to interpret these models.

Bagged Trees

tree_depth: maximum depth of the tree

min_n: minimum number of observations in the terminal nodes

cost_complexity: complexity parameter for the tree. The larger the value, the simpler the tree.

evaluate_bagged_trees <- function(n_boots) {
  model <- bag_tree(mode = "regression") |> 
    set_engine("rpart", times = n_boots) |>  # Adjust number of bootstraps
    fit(sale_price ~ ., data = feat_trn)
  
  predictions <- predict(model, new_data = feat_test) |> 
    bind_cols(feat_test) |> 
    metrics(truth = sale_price, estimate = .pred)
  
  rmse_value <- predictions |> filter(.metric == "rmse") |> pull(.estimate)
  return(rmse_value)
}

# Test different bootstrap values
boot_values <- seq(2, 100, by = 10)  # Adjust as needed
rmse_results <- sapply(boot_values, evaluate_bagged_trees)

# Create a dataframe for plotting
results_df <- data.frame(Bootstraps = boot_values, RMSE = rmse_results)

# Plot RMSE vs. Number of Bootstraps
ggplot(results_df, aes(x = Bootstraps, y = RMSE)) +
  geom_line() +
  geom_point() +
  labs(title = "Performance of Bagged Trees with Increasing Bootstraps",
       x = "Number of Bootstraps",
       y = "RMSE") +
  theme_minimal()

How do we determine the optimal number of bootstraps for bagged trees? We can use cross-validation to evaluate the model performance with different numbers of bootstraps.

The performance of bagged trees generally improves with more bootstraps. However, the improvement may not be significant after a certain number of bootstraps. When the performance stabilizes, we can choose that optimal number of bootstraps to save computation time.

The usual number of bootstraps is 100, but you can adjust this number based on your dataset size, number of features you have, and computational resources. You might start with a larger number if you have a small dataset or a large number of features.

Random Forest

  • mtry: number of variables randomly sampled as candidates at each split

  • trees: number of trees in the forest

  • min_n: minimum number of observations in the terminal nodes

grid_rf <- expand_grid(trees = c(250, 500, 750, 1000), 
                       mtry = c(5, 10, 20, 25), 
                       min_n = c(1, 2, 5, 10))
splits_boot <- bootstraps(data = data_trn, times = 10, strata = "sale_price")

fits_rf <- 
  rand_forest(trees = tune(), mtry = tune(), min_n = tune()) |>
  set_engine("ranger",
             respect.unordered.factors = "order",
             oob.error = FALSE,
             seed = 102030) |> 
  set_mode("regression") |>
  tune_grid(preprocessor = rec,
            resamples = splits_boot,
            grid = grid_rf,
            metrics = metric_set(rmse))

ranger documentation

  • respect.unordered.factors: Unorderedfactor covariates can be handled in 3 different ways by using respect.unordered.factors: For ’ignore’ all factors are regarded ordered, for ’partition’ all possible 2-partitions are considered for splitting. For ’order’ and 2-class classification the factor levels are ordered by their proportion falling in the second class, for regression by their mean response, as described in Hastie et al. (2009), chapter 9.2.4. For multiclass classification the factor levels are ordered by the first principal component of the weighted covariance matrix of the contingency table (Coppersmith et al. 1999), for survival by the median survival (or the largest available quantile if the median is not available). The use of ’order’ is recommended, as it computationally fast and can handle an unlimited number of factor levels. Note that the factors are only reordered once and not again in each split.

  • oob.error: If TRUE, the out-of-bag error is calculated. The out-of-bag error is the mean squared error of the predictions on the out-of-bag samples for the bootstraps. It is different mean squared error on the test set. We prefer to use the test set error for model evaluation because it allows us to compare the performance of different models. If FALSE, the out-of-bag error is not calculated. This can save some computation time.

  • seed: Random seed for reproducibility. This is the inherent random seed set for the ranger engine. Somehow the seed set in the code chunk does not work for ranger, so we will set up seed here to make it reproducible.

Metrics to evaluate decision trees

Classification: accuracy, balanced accuracy, F1 score, precision, recall, auROC, ppv, npv, kappa, etc.

Regression: RMSE, MAE, MAPE, R-squared, etc.

Class imbalance

Class imbalance is a common issue in classification problems where the number of observations in one class is significantly lower than the other class. This can lead to biased models that predict the majority class more accurately than the minority class.

Data balance can significantly impact the performance of decision trees and random forests. Imbalanced data can lead to biased models that predict the majority class more accurately than the minority class. In such cases, we can use techniques like oversampling, undersampling, or SMOTE to balance the data.

Missingness

  • Missing data in Decision Trees: decision trees naturally deal with missing data because they use the available data to make splits. So, if there is missing data, it is not used to make decisions about splits.

  • Missing data in Random Forest: Random Forest is a collection of decision trees so when there is missing data in a variable and that variable is used for splitting multiple times, this can create bias in the model. Random Forest imputes missing data by using the median of the non-missing values in the training set.