library(Matrix, exclude = c("expand", "pack", "unpack"))
Unit 09 Lab Agenda
Package Conflicts
- Why do I constantly see conflict error when doing resampling especially with tidyverse and Matrix packages? The only way I can get around it is to mannually library the Matrix package and exclude the function that is conflicting with the ones in tidyverse.
Methods to speed up EDA (skimr) when dealing with large-scale data
library(skimr)
<- ames |> janitor::clean_names()
ames |> skim() ames
Name | ames |
Number of rows | 2930 |
Number of columns | 74 |
_______________________ | |
Column type frequency: | |
factor | 40 |
numeric | 34 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
ms_sub_class | 0 | 1 | FALSE | 16 | One: 1079, Two: 575, One: 287, One: 192 |
ms_zoning | 0 | 1 | FALSE | 7 | Res: 2273, Res: 462, Flo: 139, Res: 27 |
street | 0 | 1 | FALSE | 2 | Pav: 2918, Grv: 12 |
alley | 0 | 1 | FALSE | 3 | No_: 2732, Gra: 120, Pav: 78 |
lot_shape | 0 | 1 | FALSE | 4 | Reg: 1859, Sli: 979, Mod: 76, Irr: 16 |
land_contour | 0 | 1 | FALSE | 4 | Lvl: 2633, HLS: 120, Bnk: 117, Low: 60 |
utilities | 0 | 1 | FALSE | 3 | All: 2927, NoS: 2, NoS: 1 |
lot_config | 0 | 1 | FALSE | 5 | Ins: 2140, Cor: 511, Cul: 180, FR2: 85 |
land_slope | 0 | 1 | FALSE | 3 | Gtl: 2789, Mod: 125, Sev: 16 |
neighborhood | 0 | 1 | FALSE | 28 | Nor: 443, Col: 267, Old: 239, Edw: 194 |
condition_1 | 0 | 1 | FALSE | 9 | Nor: 2522, Fee: 164, Art: 92, RRA: 50 |
condition_2 | 0 | 1 | FALSE | 8 | Nor: 2900, Fee: 13, Art: 5, Pos: 4 |
bldg_type | 0 | 1 | FALSE | 5 | One: 2425, Twn: 233, Dup: 109, Twn: 101 |
house_style | 0 | 1 | FALSE | 8 | One: 1481, Two: 873, One: 314, SLv: 128 |
overall_cond | 0 | 1 | FALSE | 9 | Ave: 1654, Abo: 533, Goo: 390, Ver: 144 |
roof_style | 0 | 1 | FALSE | 6 | Gab: 2321, Hip: 551, Gam: 22, Fla: 20 |
roof_matl | 0 | 1 | FALSE | 8 | Com: 2887, Tar: 23, WdS: 9, WdS: 7 |
exterior_1st | 0 | 1 | FALSE | 16 | Vin: 1026, Met: 450, HdB: 442, Wd : 420 |
exterior_2nd | 0 | 1 | FALSE | 17 | Vin: 1015, Met: 447, HdB: 406, Wd : 397 |
mas_vnr_type | 0 | 1 | FALSE | 5 | Non: 1775, Brk: 880, Sto: 249, Brk: 25 |
exter_cond | 0 | 1 | FALSE | 5 | Typ: 2549, Goo: 299, Fai: 67, Exc: 12 |
foundation | 0 | 1 | FALSE | 6 | PCo: 1310, CBl: 1244, Brk: 311, Sla: 49 |
bsmt_cond | 0 | 1 | FALSE | 6 | Typ: 2616, Goo: 122, Fai: 104, No_: 80 |
bsmt_exposure | 0 | 1 | FALSE | 5 | No: 1906, Av: 418, Gd: 284, Mn: 239 |
bsmt_fin_type_1 | 0 | 1 | FALSE | 7 | GLQ: 859, Unf: 851, ALQ: 429, Rec: 288 |
bsmt_fin_type_2 | 0 | 1 | FALSE | 7 | Unf: 2499, Rec: 106, LwQ: 89, No_: 81 |
heating | 0 | 1 | FALSE | 6 | Gas: 2885, Gas: 27, Gra: 9, Wal: 6 |
heating_qc | 0 | 1 | FALSE | 5 | Exc: 1495, Typ: 864, Goo: 476, Fai: 92 |
central_air | 0 | 1 | FALSE | 2 | Y: 2734, N: 196 |
electrical | 0 | 1 | FALSE | 6 | SBr: 2682, Fus: 188, Fus: 50, Fus: 8 |
functional | 0 | 1 | FALSE | 8 | Typ: 2728, Min: 70, Min: 65, Mod: 35 |
garage_type | 0 | 1 | FALSE | 7 | Att: 1731, Det: 782, Bui: 186, No_: 157 |
garage_finish | 0 | 1 | FALSE | 4 | Unf: 1231, RFn: 812, Fin: 728, No_: 159 |
garage_cond | 0 | 1 | FALSE | 6 | Typ: 2665, No_: 159, Fai: 74, Goo: 15 |
paved_drive | 0 | 1 | FALSE | 3 | Pav: 2652, Dir: 216, Par: 62 |
pool_qc | 0 | 1 | FALSE | 5 | No_: 2917, Exc: 4, Goo: 4, Typ: 3 |
fence | 0 | 1 | FALSE | 5 | No_: 2358, Min: 330, Goo: 118, Goo: 112 |
misc_feature | 0 | 1 | FALSE | 6 | Non: 2824, She: 95, Gar: 5, Oth: 4 |
sale_type | 0 | 1 | FALSE | 10 | WD : 2536, New: 239, COD: 87, Con: 26 |
sale_condition | 0 | 1 | FALSE | 6 | Nor: 2413, Par: 245, Abn: 190, Fam: 46 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
lot_frontage | 0 | 1 | 57.65 | 33.50 | 0.00 | 43.00 | 63.00 | 78.00 | 313.00 | ▇▇▁▁▁ |
lot_area | 0 | 1 | 10147.92 | 7880.02 | 1300.00 | 7440.25 | 9436.50 | 11555.25 | 215245.00 | ▇▁▁▁▁ |
year_built | 0 | 1 | 1971.36 | 30.25 | 1872.00 | 1954.00 | 1973.00 | 2001.00 | 2010.00 | ▁▂▃▆▇ |
year_remod_add | 0 | 1 | 1984.27 | 20.86 | 1950.00 | 1965.00 | 1993.00 | 2004.00 | 2010.00 | ▅▂▂▃▇ |
mas_vnr_area | 0 | 1 | 101.10 | 178.63 | 0.00 | 0.00 | 0.00 | 162.75 | 1600.00 | ▇▁▁▁▁ |
bsmt_fin_sf_1 | 0 | 1 | 4.18 | 2.23 | 0.00 | 3.00 | 3.00 | 7.00 | 7.00 | ▃▂▇▁▇ |
bsmt_fin_sf_2 | 0 | 1 | 49.71 | 169.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1526.00 | ▇▁▁▁▁ |
bsmt_unf_sf | 0 | 1 | 559.07 | 439.54 | 0.00 | 219.00 | 465.50 | 801.75 | 2336.00 | ▇▅▂▁▁ |
total_bsmt_sf | 0 | 1 | 1051.26 | 440.97 | 0.00 | 793.00 | 990.00 | 1301.50 | 6110.00 | ▇▃▁▁▁ |
first_flr_sf | 0 | 1 | 1159.56 | 391.89 | 334.00 | 876.25 | 1084.00 | 1384.00 | 5095.00 | ▇▃▁▁▁ |
second_flr_sf | 0 | 1 | 335.46 | 428.40 | 0.00 | 0.00 | 0.00 | 703.75 | 2065.00 | ▇▃▂▁▁ |
gr_liv_area | 0 | 1 | 1499.69 | 505.51 | 334.00 | 1126.00 | 1442.00 | 1742.75 | 5642.00 | ▇▇▁▁▁ |
bsmt_full_bath | 0 | 1 | 0.43 | 0.52 | 0.00 | 0.00 | 0.00 | 1.00 | 3.00 | ▇▆▁▁▁ |
bsmt_half_bath | 0 | 1 | 0.06 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | ▇▁▁▁▁ |
full_bath | 0 | 1 | 1.57 | 0.55 | 0.00 | 1.00 | 2.00 | 2.00 | 4.00 | ▁▇▇▁▁ |
half_bath | 0 | 1 | 0.38 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 2.00 | ▇▁▅▁▁ |
bedroom_abv_gr | 0 | 1 | 2.85 | 0.83 | 0.00 | 2.00 | 3.00 | 3.00 | 8.00 | ▁▇▂▁▁ |
kitchen_abv_gr | 0 | 1 | 1.04 | 0.21 | 0.00 | 1.00 | 1.00 | 1.00 | 3.00 | ▁▇▁▁▁ |
tot_rms_abv_grd | 0 | 1 | 6.44 | 1.57 | 2.00 | 5.00 | 6.00 | 7.00 | 15.00 | ▁▇▂▁▁ |
fireplaces | 0 | 1 | 0.60 | 0.65 | 0.00 | 0.00 | 1.00 | 1.00 | 4.00 | ▇▇▁▁▁ |
garage_cars | 0 | 1 | 1.77 | 0.76 | 0.00 | 1.00 | 2.00 | 2.00 | 5.00 | ▅▇▂▁▁ |
garage_area | 0 | 1 | 472.66 | 215.19 | 0.00 | 320.00 | 480.00 | 576.00 | 1488.00 | ▃▇▃▁▁ |
wood_deck_sf | 0 | 1 | 93.75 | 126.36 | 0.00 | 0.00 | 0.00 | 168.00 | 1424.00 | ▇▁▁▁▁ |
open_porch_sf | 0 | 1 | 47.53 | 67.48 | 0.00 | 0.00 | 27.00 | 70.00 | 742.00 | ▇▁▁▁▁ |
enclosed_porch | 0 | 1 | 23.01 | 64.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1012.00 | ▇▁▁▁▁ |
three_season_porch | 0 | 1 | 2.59 | 25.14 | 0.00 | 0.00 | 0.00 | 0.00 | 508.00 | ▇▁▁▁▁ |
screen_porch | 0 | 1 | 16.00 | 56.09 | 0.00 | 0.00 | 0.00 | 0.00 | 576.00 | ▇▁▁▁▁ |
pool_area | 0 | 1 | 2.24 | 35.60 | 0.00 | 0.00 | 0.00 | 0.00 | 800.00 | ▇▁▁▁▁ |
misc_val | 0 | 1 | 50.64 | 566.34 | 0.00 | 0.00 | 0.00 | 0.00 | 17000.00 | ▇▁▁▁▁ |
mo_sold | 0 | 1 | 6.22 | 2.71 | 1.00 | 4.00 | 6.00 | 8.00 | 12.00 | ▅▆▇▃▃ |
year_sold | 0 | 1 | 2007.79 | 1.32 | 2006.00 | 2007.00 | 2008.00 | 2009.00 | 2010.00 | ▇▇▇▇▃ |
sale_price | 0 | 1 | 180796.06 | 79886.69 | 12789.00 | 129500.00 | 160000.00 | 213500.00 | 755000.00 | ▇▇▁▁▁ |
longitude | 0 | 1 | -93.64 | 0.03 | -93.69 | -93.66 | -93.64 | -93.62 | -93.58 | ▅▅▇▆▁ |
latitude | 0 | 1 | 42.03 | 0.02 | 41.99 | 42.02 | 42.03 | 42.05 | 42.06 | ▂▂▇▇▇ |
# select a subset of summary statistics
<- skim_with(numeric = sfl(hist = NULL, p0 = NULL, p25 = NULL,
my_skim p75 = NULL, p100 = NULL))
|> my_skim() ames
Name | ames |
Number of rows | 2930 |
Number of columns | 74 |
_______________________ | |
Column type frequency: | |
factor | 40 |
numeric | 34 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
ms_sub_class | 0 | 1 | FALSE | 16 | One: 1079, Two: 575, One: 287, One: 192 |
ms_zoning | 0 | 1 | FALSE | 7 | Res: 2273, Res: 462, Flo: 139, Res: 27 |
street | 0 | 1 | FALSE | 2 | Pav: 2918, Grv: 12 |
alley | 0 | 1 | FALSE | 3 | No_: 2732, Gra: 120, Pav: 78 |
lot_shape | 0 | 1 | FALSE | 4 | Reg: 1859, Sli: 979, Mod: 76, Irr: 16 |
land_contour | 0 | 1 | FALSE | 4 | Lvl: 2633, HLS: 120, Bnk: 117, Low: 60 |
utilities | 0 | 1 | FALSE | 3 | All: 2927, NoS: 2, NoS: 1 |
lot_config | 0 | 1 | FALSE | 5 | Ins: 2140, Cor: 511, Cul: 180, FR2: 85 |
land_slope | 0 | 1 | FALSE | 3 | Gtl: 2789, Mod: 125, Sev: 16 |
neighborhood | 0 | 1 | FALSE | 28 | Nor: 443, Col: 267, Old: 239, Edw: 194 |
condition_1 | 0 | 1 | FALSE | 9 | Nor: 2522, Fee: 164, Art: 92, RRA: 50 |
condition_2 | 0 | 1 | FALSE | 8 | Nor: 2900, Fee: 13, Art: 5, Pos: 4 |
bldg_type | 0 | 1 | FALSE | 5 | One: 2425, Twn: 233, Dup: 109, Twn: 101 |
house_style | 0 | 1 | FALSE | 8 | One: 1481, Two: 873, One: 314, SLv: 128 |
overall_cond | 0 | 1 | FALSE | 9 | Ave: 1654, Abo: 533, Goo: 390, Ver: 144 |
roof_style | 0 | 1 | FALSE | 6 | Gab: 2321, Hip: 551, Gam: 22, Fla: 20 |
roof_matl | 0 | 1 | FALSE | 8 | Com: 2887, Tar: 23, WdS: 9, WdS: 7 |
exterior_1st | 0 | 1 | FALSE | 16 | Vin: 1026, Met: 450, HdB: 442, Wd : 420 |
exterior_2nd | 0 | 1 | FALSE | 17 | Vin: 1015, Met: 447, HdB: 406, Wd : 397 |
mas_vnr_type | 0 | 1 | FALSE | 5 | Non: 1775, Brk: 880, Sto: 249, Brk: 25 |
exter_cond | 0 | 1 | FALSE | 5 | Typ: 2549, Goo: 299, Fai: 67, Exc: 12 |
foundation | 0 | 1 | FALSE | 6 | PCo: 1310, CBl: 1244, Brk: 311, Sla: 49 |
bsmt_cond | 0 | 1 | FALSE | 6 | Typ: 2616, Goo: 122, Fai: 104, No_: 80 |
bsmt_exposure | 0 | 1 | FALSE | 5 | No: 1906, Av: 418, Gd: 284, Mn: 239 |
bsmt_fin_type_1 | 0 | 1 | FALSE | 7 | GLQ: 859, Unf: 851, ALQ: 429, Rec: 288 |
bsmt_fin_type_2 | 0 | 1 | FALSE | 7 | Unf: 2499, Rec: 106, LwQ: 89, No_: 81 |
heating | 0 | 1 | FALSE | 6 | Gas: 2885, Gas: 27, Gra: 9, Wal: 6 |
heating_qc | 0 | 1 | FALSE | 5 | Exc: 1495, Typ: 864, Goo: 476, Fai: 92 |
central_air | 0 | 1 | FALSE | 2 | Y: 2734, N: 196 |
electrical | 0 | 1 | FALSE | 6 | SBr: 2682, Fus: 188, Fus: 50, Fus: 8 |
functional | 0 | 1 | FALSE | 8 | Typ: 2728, Min: 70, Min: 65, Mod: 35 |
garage_type | 0 | 1 | FALSE | 7 | Att: 1731, Det: 782, Bui: 186, No_: 157 |
garage_finish | 0 | 1 | FALSE | 4 | Unf: 1231, RFn: 812, Fin: 728, No_: 159 |
garage_cond | 0 | 1 | FALSE | 6 | Typ: 2665, No_: 159, Fai: 74, Goo: 15 |
paved_drive | 0 | 1 | FALSE | 3 | Pav: 2652, Dir: 216, Par: 62 |
pool_qc | 0 | 1 | FALSE | 5 | No_: 2917, Exc: 4, Goo: 4, Typ: 3 |
fence | 0 | 1 | FALSE | 5 | No_: 2358, Min: 330, Goo: 118, Goo: 112 |
misc_feature | 0 | 1 | FALSE | 6 | Non: 2824, She: 95, Gar: 5, Oth: 4 |
sale_type | 0 | 1 | FALSE | 10 | WD : 2536, New: 239, COD: 87, Con: 26 |
sale_condition | 0 | 1 | FALSE | 6 | Nor: 2413, Par: 245, Abn: 190, Fam: 46 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p50 |
---|---|---|---|---|---|
lot_frontage | 0 | 1 | 57.65 | 33.50 | 63.00 |
lot_area | 0 | 1 | 10147.92 | 7880.02 | 9436.50 |
year_built | 0 | 1 | 1971.36 | 30.25 | 1973.00 |
year_remod_add | 0 | 1 | 1984.27 | 20.86 | 1993.00 |
mas_vnr_area | 0 | 1 | 101.10 | 178.63 | 0.00 |
bsmt_fin_sf_1 | 0 | 1 | 4.18 | 2.23 | 3.00 |
bsmt_fin_sf_2 | 0 | 1 | 49.71 | 169.14 | 0.00 |
bsmt_unf_sf | 0 | 1 | 559.07 | 439.54 | 465.50 |
total_bsmt_sf | 0 | 1 | 1051.26 | 440.97 | 990.00 |
first_flr_sf | 0 | 1 | 1159.56 | 391.89 | 1084.00 |
second_flr_sf | 0 | 1 | 335.46 | 428.40 | 0.00 |
gr_liv_area | 0 | 1 | 1499.69 | 505.51 | 1442.00 |
bsmt_full_bath | 0 | 1 | 0.43 | 0.52 | 0.00 |
bsmt_half_bath | 0 | 1 | 0.06 | 0.25 | 0.00 |
full_bath | 0 | 1 | 1.57 | 0.55 | 2.00 |
half_bath | 0 | 1 | 0.38 | 0.50 | 0.00 |
bedroom_abv_gr | 0 | 1 | 2.85 | 0.83 | 3.00 |
kitchen_abv_gr | 0 | 1 | 1.04 | 0.21 | 1.00 |
tot_rms_abv_grd | 0 | 1 | 6.44 | 1.57 | 6.00 |
fireplaces | 0 | 1 | 0.60 | 0.65 | 1.00 |
garage_cars | 0 | 1 | 1.77 | 0.76 | 2.00 |
garage_area | 0 | 1 | 472.66 | 215.19 | 480.00 |
wood_deck_sf | 0 | 1 | 93.75 | 126.36 | 0.00 |
open_porch_sf | 0 | 1 | 47.53 | 67.48 | 27.00 |
enclosed_porch | 0 | 1 | 23.01 | 64.14 | 0.00 |
three_season_porch | 0 | 1 | 2.59 | 25.14 | 0.00 |
screen_porch | 0 | 1 | 16.00 | 56.09 | 0.00 |
pool_area | 0 | 1 | 2.24 | 35.60 | 0.00 |
misc_val | 0 | 1 | 50.64 | 566.34 | 0.00 |
mo_sold | 0 | 1 | 6.22 | 2.71 | 6.00 |
year_sold | 0 | 1 | 2007.79 | 1.32 | 2008.00 |
sale_price | 0 | 1 | 180796.06 | 79886.69 | 160000.00 |
longitude | 0 | 1 | -93.64 | 0.03 | -93.64 |
latitude | 0 | 1 | 42.03 | 0.02 | 42.03 |
# select a subset of columns
|> summarize(across(where(is.numeric), ~median(.x, na.rm = TRUE))) ames
# A tibble: 1 × 34
lot_frontage lot_area year_built year_remod_add mas_vnr_area bsmt_fin_sf_1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 63 9436. 1973 1993 0 3
bsmt_fin_sf_2 bsmt_unf_sf total_bsmt_sf first_flr_sf second_flr_sf gr_liv_area
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 466. 990 1084 0 1442
bsmt_full_bath bsmt_half_bath full_bath half_bath bedroom_abv_gr
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 2 0 3
kitchen_abv_gr tot_rms_abv_grd fireplaces garage_cars garage_area wood_deck_sf
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6 1 2 480 0
open_porch_sf enclosed_porch three_season_porch screen_porch pool_area
<dbl> <dbl> <dbl> <dbl> <dbl>
1 27 0 0 0 0
misc_val mo_sold year_sold sale_price longitude latitude
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 6 2008 160000 -93.6 42.0
Graph Interpretation
Example decision tree graph with different tree depths (min_n)
set.seed(123)
<- ames |>
data_trn initial_split(prop = 3/4, strata = "sale_price", breaks = 4) |>
analysis()
<- ames |>
data_test initial_split(prop = 3/4, strata = "sale_price", breaks = 4) |>
assessment()
<- recipe(sale_price ~ ., data = data_trn)
rec <- rec |>
rec_prep prep(data_trn)
<- rec_prep |> bake(NULL)
feat_trn <- rec_prep |> bake(data_test)
feat_test |> skim_some() feat_trn
Name | feat_trn |
Number of rows | 2197 |
Number of columns | 74 |
_______________________ | |
Column type frequency: | |
factor | 40 |
numeric | 34 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
ms_sub_class | 0 | 1 | FALSE | 16 | One: 796, Two: 436, One: 217, One: 154 |
ms_zoning | 0 | 1 | FALSE | 7 | Res: 1695, Res: 350, Flo: 109, C_a: 21 |
street | 0 | 1 | FALSE | 2 | Pav: 2186, Grv: 11 |
alley | 0 | 1 | FALSE | 3 | No_: 2043, Gra: 90, Pav: 64 |
lot_shape | 0 | 1 | FALSE | 4 | Reg: 1377, Sli: 755, Mod: 56, Irr: 9 |
land_contour | 0 | 1 | FALSE | 4 | Lvl: 1961, Bnk: 98, HLS: 94, Low: 44 |
utilities | 0 | 1 | FALSE | 3 | All: 2194, NoS: 2, NoS: 1 |
lot_config | 0 | 1 | FALSE | 5 | Ins: 1609, Cor: 380, Cul: 129, FR2: 67 |
land_slope | 0 | 1 | FALSE | 3 | Gtl: 2087, Mod: 100, Sev: 10 |
neighborhood | 0 | 1 | FALSE | 28 | Nor: 326, Col: 204, Old: 175, Edw: 146 |
condition_1 | 0 | 1 | FALSE | 9 | Nor: 1881, Fee: 127, Art: 71, RRA: 37 |
condition_2 | 0 | 1 | FALSE | 7 | Nor: 2175, Fee: 10, Art: 4, Pos: 3 |
bldg_type | 0 | 1 | FALSE | 5 | One: 1802, Twn: 187, Dup: 84, Twn: 77 |
house_style | 0 | 1 | FALSE | 8 | One: 1109, Two: 661, One: 231, SLv: 98 |
overall_cond | 0 | 1 | FALSE | 9 | Ave: 1257, Abo: 386, Goo: 288, Ver: 111 |
roof_style | 0 | 1 | FALSE | 6 | Gab: 1745, Hip: 413, Gam: 16, Fla: 12 |
roof_matl | 0 | 1 | FALSE | 6 | Com: 2168, Tar: 16, WdS: 6, WdS: 5 |
exterior_1st | 0 | 1 | FALSE | 15 | Vin: 762, HdB: 341, Met: 339, Wd : 307 |
exterior_2nd | 0 | 1 | FALSE | 16 | Vin: 758, Met: 333, HdB: 305, Wd : 290 |
mas_vnr_type | 0 | 1 | FALSE | 5 | Non: 1330, Brk: 659, Sto: 190, Brk: 17 |
exter_cond | 0 | 1 | FALSE | 5 | Typ: 1919, Goo: 219, Fai: 51, Exc: 7 |
foundation | 0 | 1 | FALSE | 6 | PCo: 991, CBl: 928, Brk: 226, Sla: 41 |
bsmt_cond | 0 | 1 | FALSE | 6 | Typ: 1961, Fai: 84, Goo: 81, No_: 64 |
bsmt_exposure | 0 | 1 | FALSE | 5 | No: 1439, Av: 312, Gd: 203, Mn: 176 |
bsmt_fin_type_1 | 0 | 1 | FALSE | 7 | GLQ: 654, Unf: 651, ALQ: 315, Rec: 202 |
bsmt_fin_type_2 | 0 | 1 | FALSE | 7 | Unf: 1880, Rec: 75, No_: 64, LwQ: 63 |
heating | 0 | 1 | FALSE | 6 | Gas: 2158, Gas: 23, Gra: 8, Wal: 6 |
heating_qc | 0 | 1 | FALSE | 5 | Exc: 1130, Typ: 630, Goo: 357, Fai: 77 |
central_air | 0 | 1 | FALSE | 2 | Y: 2047, N: 150 |
electrical | 0 | 1 | FALSE | 5 | SBr: 2010, Fus: 141, Fus: 40, Fus: 5 |
functional | 0 | 1 | FALSE | 8 | Typ: 2039, Min: 54, Min: 54, Mod: 24 |
garage_type | 0 | 1 | FALSE | 7 | Att: 1288, Det: 591, Bui: 138, No_: 122 |
garage_finish | 0 | 1 | FALSE | 4 | Unf: 920, RFn: 616, Fin: 537, No_: 124 |
garage_cond | 0 | 1 | FALSE | 6 | Typ: 1996, No_: 124, Fai: 52, Poo: 12 |
paved_drive | 0 | 1 | FALSE | 3 | Pav: 1985, Dir: 166, Par: 46 |
pool_qc | 0 | 1 | FALSE | 5 | No_: 2186, Goo: 4, Typ: 3, Exc: 2 |
fence | 0 | 1 | FALSE | 5 | No_: 1780, Min: 236, Goo: 87, Goo: 86 |
misc_feature | 0 | 1 | FALSE | 6 | Non: 2123, She: 65, Oth: 4, Gar: 3 |
sale_type | 0 | 1 | FALSE | 10 | WD : 1906, New: 178, COD: 64, Con: 18 |
sale_condition | 0 | 1 | FALSE | 6 | Nor: 1800, Par: 183, Abn: 148, Fam: 38 |
Variable type: numeric
skim_variable | n_missing | complete_rate | p0 | p100 |
---|---|---|---|---|
lot_frontage | 0 | 1 | 0.00 | 313.00 |
lot_area | 0 | 1 | 1300.00 | 215245.00 |
year_built | 0 | 1 | 1872.00 | 2010.00 |
year_remod_add | 0 | 1 | 1950.00 | 2010.00 |
mas_vnr_area | 0 | 1 | 0.00 | 1600.00 |
bsmt_fin_sf_1 | 0 | 1 | 0.00 | 7.00 |
bsmt_fin_sf_2 | 0 | 1 | 0.00 | 1393.00 |
bsmt_unf_sf | 0 | 1 | 0.00 | 2336.00 |
total_bsmt_sf | 0 | 1 | 0.00 | 6110.00 |
first_flr_sf | 0 | 1 | 334.00 | 5095.00 |
second_flr_sf | 0 | 1 | 0.00 | 1872.00 |
gr_liv_area | 0 | 1 | 334.00 | 5642.00 |
bsmt_full_bath | 0 | 1 | 0.00 | 3.00 |
bsmt_half_bath | 0 | 1 | 0.00 | 2.00 |
full_bath | 0 | 1 | 0.00 | 4.00 |
half_bath | 0 | 1 | 0.00 | 2.00 |
bedroom_abv_gr | 0 | 1 | 0.00 | 8.00 |
kitchen_abv_gr | 0 | 1 | 0.00 | 3.00 |
tot_rms_abv_grd | 0 | 1 | 2.00 | 15.00 |
fireplaces | 0 | 1 | 0.00 | 4.00 |
garage_cars | 0 | 1 | 0.00 | 4.00 |
garage_area | 0 | 1 | 0.00 | 1488.00 |
wood_deck_sf | 0 | 1 | 0.00 | 870.00 |
open_porch_sf | 0 | 1 | 0.00 | 742.00 |
enclosed_porch | 0 | 1 | 0.00 | 1012.00 |
three_season_porch | 0 | 1 | 0.00 | 508.00 |
screen_porch | 0 | 1 | 0.00 | 576.00 |
pool_area | 0 | 1 | 0.00 | 800.00 |
misc_val | 0 | 1 | 0.00 | 17000.00 |
mo_sold | 0 | 1 | 1.00 | 12.00 |
year_sold | 0 | 1 | 2006.00 | 2010.00 |
longitude | 0 | 1 | -93.69 | -93.58 |
latitude | 0 | 1 | 41.99 | 42.06 |
sale_price | 0 | 1 | 12789.00 | 755000.00 |
# Decision Tree
<-
fit_tree_1 decision_tree(tree_depth = 1, min_n = 2, cost_complexity = 0) |>
set_engine("rpart", model = TRUE) |>
set_mode("regression") |>
fit(sale_price ~ ., data = feat_trn)
$fit |> rpart.plot::rpart.plot() fit_tree_1
<-
fit_tree_2 decision_tree(tree_depth = 2, min_n = 2, cost_complexity = 0) |>
set_engine("rpart", model = TRUE) |>
set_mode("regression") |>
fit(sale_price ~ ., data = feat_trn)
$fit |> rpart.plot::rpart.plot() fit_tree_2
<-
fit_tree_3 decision_tree(tree_depth = 3, min_n = 2, cost_complexity = 0) |>
set_engine("rpart", model = TRUE) |>
set_mode("regression") |>
fit(sale_price ~ ., data = feat_trn)
$fit |> rpart.plot::rpart.plot() fit_tree_3
Decision tree is only somewhat interpretable when your tree structure is simple. As the tree depth increases, the tree becomes more complex and harder to interpret.
Bagged trees and random forests are inherently not interpretable. However, we will introduce a few model agnostic approaches in later chapters to interpret these models.
Bagged Trees
tree_depth
: maximum depth of the tree
min_n
: minimum number of observations in the terminal nodes
cost_complexity
: complexity parameter for the tree. The larger the value, the simpler the tree.
<- function(n_boots) {
evaluate_bagged_trees <- bag_tree(mode = "regression") |>
model set_engine("rpart", times = n_boots) |> # Adjust number of bootstraps
fit(sale_price ~ ., data = feat_trn)
<- predict(model, new_data = feat_test) |>
predictions bind_cols(feat_test) |>
metrics(truth = sale_price, estimate = .pred)
<- predictions |> filter(.metric == "rmse") |> pull(.estimate)
rmse_value return(rmse_value)
}
# Test different bootstrap values
<- seq(2, 100, by = 10) # Adjust as needed
boot_values <- sapply(boot_values, evaluate_bagged_trees)
rmse_results
# Create a dataframe for plotting
<- data.frame(Bootstraps = boot_values, RMSE = rmse_results)
results_df
# Plot RMSE vs. Number of Bootstraps
ggplot(results_df, aes(x = Bootstraps, y = RMSE)) +
geom_line() +
geom_point() +
labs(title = "Performance of Bagged Trees with Increasing Bootstraps",
x = "Number of Bootstraps",
y = "RMSE") +
theme_minimal()
How do we determine the optimal number of bootstraps for bagged trees? We can use cross-validation to evaluate the model performance with different numbers of bootstraps.
The performance of bagged trees generally improves with more bootstraps. However, the improvement may not be significant after a certain number of bootstraps. When the performance stabilizes, we can choose that optimal number of bootstraps to save computation time.
The usual number of bootstraps is 100, but you can adjust this number based on your dataset size, number of features you have, and computational resources. You might start with a larger number if you have a small dataset or a large number of features.
Random Forest
mtry
: number of variables randomly sampled as candidates at each splittrees
: number of trees in the forestmin_n
: minimum number of observations in the terminal nodes
<- expand_grid(trees = c(250, 500, 750, 1000),
grid_rf mtry = c(5, 10, 20, 25),
min_n = c(1, 2, 5, 10))
<- bootstraps(data = data_trn, times = 10, strata = "sale_price")
splits_boot
<-
fits_rf rand_forest(trees = tune(), mtry = tune(), min_n = tune()) |>
set_engine("ranger",
respect.unordered.factors = "order",
oob.error = FALSE,
seed = 102030) |>
set_mode("regression") |>
tune_grid(preprocessor = rec,
resamples = splits_boot,
grid = grid_rf,
metrics = metric_set(rmse))
respect.unordered.factors
: Unorderedfactor covariates can be handled in 3 different ways by using respect.unordered.factors: For ’ignore’ all factors are regarded ordered, for ’partition’ all possible 2-partitions are considered for splitting. For ’order’ and 2-class classification the factor levels are ordered by their proportion falling in the second class, for regression by their mean response, as described in Hastie et al. (2009), chapter 9.2.4. For multiclass classification the factor levels are ordered by the first principal component of the weighted covariance matrix of the contingency table (Coppersmith et al. 1999), for survival by the median survival (or the largest available quantile if the median is not available). The use of ’order’ is recommended, as it computationally fast and can handle an unlimited number of factor levels. Note that the factors are only reordered once and not again in each split.oob.error
: If TRUE, the out-of-bag error is calculated. The out-of-bag error is the mean squared error of the predictions on the out-of-bag samples for the bootstraps. It is different mean squared error on the test set. We prefer to use the test set error for model evaluation because it allows us to compare the performance of different models. If FALSE, the out-of-bag error is not calculated. This can save some computation time.seed
: Random seed for reproducibility. This is the inherent random seed set for the ranger engine. Somehow the seed set in the code chunk does not work for ranger, so we will set up seed here to make it reproducible.
Metrics to evaluate decision trees
Classification: accuracy, balanced accuracy, F1 score, precision, recall, auROC, ppv, npv, kappa, etc.
Regression: RMSE, MAE, MAPE, R-squared, etc.
Class imbalance
Class imbalance is a common issue in classification problems where the number of observations in one class is significantly lower than the other class. This can lead to biased models that predict the majority class more accurately than the minority class.
Data balance can significantly impact the performance of decision trees and random forests. Imbalanced data can lead to biased models that predict the majority class more accurately than the minority class. In such cases, we can use techniques like oversampling, undersampling, or SMOTE to balance the data.
Missingness
Missing data in Decision Trees: decision trees naturally deal with missing data because they use the available data to make splits. So, if there is missing data, it is not used to make decisions about splits.
Missing data in Random Forest: Random Forest is a collection of decision trees so when there is missing data in a variable and that variable is used for splitting multiple times, this can create bias in the model. Random Forest imputes missing data by using the median of the non-missing values in the training set.