4 File and Path Management

4.1 Use of RStudio Projects

The use of RStudio Projects is critical to good managament of your paths and files. When you work within a project, you will have a working directory set within that project (based on where the project files is saved. This working directory can then be combined with relative paths for reading and writing data and other files. It also means that if you share the folders that contain your project (e.g., scripts, data), the paths will continue to work for that colleague as well, regardless of where they situate the folders on their computer.

Wickham et al., describe the rationale and benefits for using projects. Please read this! They also clearly describe the steps to set up a new project so I won’t repeat them here.

For our course, we strongly recommend that you set up a project called “iaml”. Inside that root project folder, you can establish a folder for “homework”, and inside that folder you can have sub-folders for each unit (e.g., “unit_2”, “unit_3”). In addition to the homework folder, you might have folders for exams (e.g., “midterm”) and other material that you save (e.g., “pdfs”).

4.2 Relative Paths

You should also get in the habit of setting relative paths (relative to your project root) near the start of your script so that you can call those paths easily throughout. Added bonus, if you move those folders within your project, you just need to change one line of code. For example if your raw data and processed data live in separate folders you might have two paths set:

path_raw <- "data/raw"

path_processed <- "data/processed"

You can use these path objects with the base R function file.path()

For example, if you want to load a csv file in the folder that you indicated above by path_raw, you could use this line of code:

d <- read_csv(file.path(path_raw, "raw_data.csv"))

alternatively, you could supply the relative path directly (though this is not preferred because it can be cumbersome if you move the folder later)

d <- read_csv(here("data/processed", "raw_data.csv"))

4.3 Reading csv files

We typically save our data as csv files (with minor exceptions). There are many benefits to this format

easy to share (colleagues don’t need R to access)
easy to view outside of R (sometimes, we just want to see the data directly and csv can be viewed in any text editor and/or most spreadsheet apps) but one downside is that they don’t store information about variable/column class. We need to establish the appropriate class for each column when we read the data.

4.3.1 Using `col_types()`

If possible, it is best to set the class for each column/variable specifically using the col_types() parameter in dplyr::read_csv() This forces you to specifically examine and consider each column to decide its class (e.g., is a column with numbers best set as numeric or ordered factor) and the levels if its class is nominal. Of course, this is part of cleaning EDA so you should have done this when you first started working with the data.

Re-classing is typically needed to convert raw character columns to factor (ordered or unordered) and sometimes to convert raw numeric columns to factor (likely ordered, e.g., a likert scale).

Here is an example using the cars dataset

path_data <- "data"
df <- read_csv(file.path(path_data, "auto_trn.csv"),
               col_type = list(mpg = col_factor(levels = c("low", "high")),
                               # here we handle cylinders as an ordered factor
                               cylinders = col_factor(levels = 
                                                        as.character(c(3,4,5,6,8)), 
                                                      ordered = TRUE),   
                               displacement = col_double(),
                               horsepower = col_double(),
                               weight = col_double(),
                               acceleration = col_double(),
                               year = col_double(),
                               origin = col_factor(levels = 
                                                     c("american", 
                                                       "japanese", 
                                                       "european")))) %>% 
  glimpse()

Rows: 294
Columns: 8
$ mpg          <fct> high, high, high, high, high, high, high, high, high, hig…
$ cylinders    <ord> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
$ displacement <dbl> 113.0, 97.0, 97.0, 110.0, 107.0, 104.0, 121.0, 97.0, 140.…
$ horsepower   <dbl> 95, 88, 46, 87, 90, 95, 113, 88, 90, 95, 86, 90, 70, 76, …
$ weight       <dbl> 2372, 2130, 1835, 2672, 2430, 2375, 2234, 2130, 2264, 222…
$ acceleration <dbl> 15.0, 14.5, 20.5, 17.5, 14.5, 17.5, 12.5, 14.5, 15.5, 14.…
$ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 71, 71, 71, 71, 71, 71, 71, 7…
$ origin       <fct> japanese, japanese, european, european, european, europea…

4.3.2 Using a separate `mutate()`

In some instances (e.g., data file with very large number of variables, very consistently organized data character data is well-behaved), you may want to read the data in first and then use mutate() to change classes as needed.

In these instances, we prefer to set the col_types() parameter to cols() to prevent the verbose message about column classes.

Here is an example using the ames dataset with all predictors. To start, we only re-class all character columns to unordered factor and one numeric column to an ordered factor. As we work with the data (during cleaning EDA), we may decide that there are other columns that need to be re-classed. If so, we could add additional lines to the mutuate()

df <- read_csv(file.path(path_data, "ames_full_cln.csv"),
               col_types = cols()) %>% 
  # convert all character to unordered factors
  mutate(across(where(is.character), as_factor),
         overall_qual = ordered(overall_qual, levels = as.character(1:10))) %>% 
  glimpse()

Rows: 1,955
Columns: 81
$ pid             <fct> x0526301100, x0526350040, x0526351010, x0527105010, x0…
$ ms_sub_class    <fct> x020, x020, x020, x060, x120, x120, x120, x060, x060, …
$ ms_zoning       <fct> rl, rh, rl, rl, rl, rl, rl, rl, rl, rl, rl, rl, rl, rl…
$ lot_frontage    <dbl> 141, 80, 81, 74, 41, 43, 39, 60, 75, 63, 85, NA, 47, 1…
$ lot_area        <dbl> 31770, 11622, 14267, 13830, 4920, 5005, 5389, 7500, 10…
$ street          <fct> pave, pave, pave, pave, pave, pave, pave, pave, pave, …
$ alley           <fct> none, none, none, none, none, none, none, none, none, …
$ lot_shape       <fct> ir1, reg, ir1, ir1, reg, ir1, ir1, reg, ir1, ir1, reg,…
$ land_contour    <fct> lvl, lvl, lvl, lvl, lvl, hls, lvl, lvl, lvl, lvl, lvl,…
$ utilities       <fct> all_pub, all_pub, all_pub, all_pub, all_pub, all_pub, …
$ lot_config      <fct> corner, inside, corner, inside, inside, inside, inside…
$ land_slope      <fct> gtl, gtl, gtl, gtl, gtl, gtl, gtl, gtl, gtl, gtl, gtl,…
$ neighborhood    <fct> n_ames, n_ames, n_ames, gilbert, stone_br, stone_br, s…
$ condition_1     <fct> norm, feedr, norm, norm, norm, norm, norm, norm, norm,…
$ condition_2     <fct> norm, norm, norm, norm, norm, norm, norm, norm, norm, …
$ bldg_type       <fct> one_fam, one_fam, one_fam, one_fam, twhs_ext, twhs_ext…
$ house_style     <fct> x1story, x1story, x1story, x2story, x1story, x1story, …
$ overall_qual    <ord> 6, 5, 6, 5, 8, 8, 8, 7, 6, 6, 7, 8, 8, 8, 9, 4, 6, 6, …
$ overall_cond    <dbl> 5, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 7, 2, 5, 6, 6, …
$ year_built      <dbl> 1960, 1961, 1958, 1997, 2001, 1992, 1995, 1999, 1993, …
$ year_remod_add  <dbl> 1960, 1961, 1958, 1998, 2001, 1992, 1996, 1999, 1994, …
$ roof_style      <fct> hip, gable, hip, gable, gable, gable, gable, gable, ga…
$ roof_matl       <fct> comp_shg, comp_shg, comp_shg, comp_shg, comp_shg, comp…
$ exterior_1st    <fct> brk_face, vinyl_sd, wd_sdng, vinyl_sd, cemnt_bd, hd_bo…
$ exterior_2nd    <fct> plywood, vinyl_sd, wd_sdng, vinyl_sd, cment_bd, hd_boa…
$ mas_vnr_type    <fct> stone, none, brk_face, none, none, none, none, none, n…
$ mas_vnr_area    <dbl> 112, 0, 108, 0, 0, 0, 0, 0, 0, 0, 0, 0, 603, 0, 350, 0…
$ exter_qual      <fct> ta, ta, ta, ta, gd, gd, gd, ta, ta, ta, ta, gd, ex, gd…
$ exter_cond      <fct> ta, ta, ta, ta, ta, ta, ta, ta, ta, ta, ta, ta, ta, ta…
$ foundation      <fct> c_block, c_block, c_block, p_conc, p_conc, p_conc, p_c…
$ bsmt_qual       <fct> ta, ta, ta, gd, gd, gd, gd, ta, gd, gd, gd, gd, gd, gd…
$ bsmt_cond       <fct> gd, ta, ta, ta, ta, ta, ta, ta, ta, ta, ta, ta, ta, ta…
$ bsmt_exposure   <fct> gd, no, no, no, mn, no, no, no, no, no, gd, av, gd, av…
$ bsmt_fin_type_1 <fct> blq, rec, alq, glq, glq, alq, glq, unf, unf, unf, glq,…
$ bsmt_fin_sf_1   <dbl> 639, 468, 923, 791, 616, 263, 1180, 0, 0, 0, 637, 368,…
$ bsmt_fin_type_2 <fct> unf, lw_q, unf, unf, unf, unf, unf, unf, unf, unf, unf…
$ bsmt_fin_sf_2   <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0, 0, 0, 0, 1…
$ bsmt_unf_sf     <dbl> 441, 270, 406, 137, 722, 1017, 415, 994, 763, 789, 663…
$ total_bsmt_sf   <dbl> 1080, 882, 1329, 928, 1338, 1280, 1595, 994, 763, 789,…
$ heating         <fct> gas_a, gas_a, gas_a, gas_a, gas_a, gas_a, gas_a, gas_a…
$ heating_qc      <fct> fa, ta, ta, gd, ex, ex, ex, gd, gd, gd, gd, ta, ex, gd…
$ central_air     <fct> y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, …
$ electrical      <fct> s_brkr, s_brkr, s_brkr, s_brkr, s_brkr, s_brkr, s_brkr…
$ x1st_flr_sf     <dbl> 1656, 896, 1329, 928, 1338, 1280, 1616, 1028, 763, 789…
$ x2nd_flr_sf     <dbl> 0, 0, 0, 701, 0, 0, 0, 776, 892, 676, 0, 0, 1589, 672,…
$ low_qual_fin_sf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ gr_liv_area     <dbl> 1656, 896, 1329, 1629, 1338, 1280, 1616, 1804, 1655, 1…
$ bsmt_full_bath  <dbl> 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, …
$ bsmt_half_bath  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ full_bath       <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, 1, 1, 2, 2, …
$ half_bath       <dbl> 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, …
$ bedroom_abv_gr  <dbl> 3, 2, 3, 3, 2, 2, 2, 3, 3, 3, 2, 1, 4, 4, 1, 2, 3, 3, …
$ kitchen_abv_gr  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ kitchen_qual    <fct> ta, ta, gd, ta, gd, gd, gd, gd, ta, ta, gd, gd, ex, ta…
$ tot_rms_abv_grd <dbl> 7, 5, 6, 6, 6, 5, 5, 7, 7, 7, 5, 4, 12, 8, 8, 4, 7, 7,…
$ functional      <fct> typ, typ, typ, typ, typ, typ, typ, typ, typ, typ, typ,…
$ fireplaces      <dbl> 2, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 2, 1, …
$ fireplace_qu    <fct> gd, none, none, ta, none, none, ta, ta, ta, gd, po, no…
$ garage_type     <fct> attchd, attchd, attchd, attchd, attchd, attchd, attchd…
$ garage_yr_blt   <dbl> 1960, 1961, 1958, 1997, 2001, 1992, 1995, 1999, 1993, …
$ garage_finish   <fct> fin, unf, unf, fin, fin, r_fn, r_fn, fin, fin, fin, un…
$ garage_cars     <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, 2, 2, 2, …
$ garage_area     <dbl> 528, 730, 312, 482, 582, 506, 608, 442, 440, 393, 506,…
$ garage_qual     <fct> ta, ta, ta, ta, ta, ta, ta, ta, ta, ta, ta, ta, ta, ta…
$ garage_cond     <fct> ta, ta, ta, ta, ta, ta, ta, ta, ta, ta, ta, ta, ta, ta…
$ paved_drive     <fct> p, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, …
$ wood_deck_sf    <dbl> 210, 140, 393, 212, 0, 0, 237, 140, 157, 0, 192, 0, 50…
$ open_porch_sf   <dbl> 62, 0, 36, 34, 0, 82, 152, 60, 84, 75, 0, 54, 36, 12, …
$ enclosed_porch  <dbl> 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ x3ssn_porch     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ screen_porch    <dbl> 0, 120, 0, 0, 0, 144, 0, 0, 0, 0, 0, 140, 210, 0, 0, 0…
$ pool_area       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pool_qc         <fct> none, none, none, none, none, none, none, none, none, …
$ fence           <fct> none, mn_prv, none, mn_prv, none, none, none, none, no…
$ misc_feature    <fct> none, none, gar2, none, none, none, none, none, none, …
$ misc_val        <dbl> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mo_sold         <dbl> 5, 6, 6, 3, 4, 1, 3, 6, 4, 5, 2, 6, 6, 6, 6, 6, 2, 1, …
$ yr_sold         <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, …
$ sale_type       <fct> wd, wd, wd, wd, wd, wd, wd, wd, wd, wd, wd, wd, wd, wd…
$ sale_condition  <fct> normal, normal, normal, normal, normal, normal, normal…
$ sale_price      <dbl> 215000, 105000, 172000, 189900, 213500, 191500, 236500…

4.3.3 Using data dictionaries (a.k.a codebooks)

4.4 Sourcing from Github

Scripts in public repositories on GithHub can be sourced directly from the remote repository on GitHub using source_url() from the `devtools’ package. To do this, follow these steps:

Find the url to the specific file/script you would like to source. This can be done by simply clinical on the file through GitHub in your browser. For example, the url to fun_modeling.R in my lab_support repo is:

https://github.com/jjcurtin/lab_support/blob/main/fun_modeling.R

Add ?raw=true to the end of that url. For example:

https://github.com/jjcurtin/lab_support/blob/main/fun_modeling.R?raw=true

Pass this url as a string into devtools::source_url()` in your R script. For example:

devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_modeling.R?raw=true")

Its that easy. Using this method will allow you to continue to use the most up-to-date version of that script even as the repo owner improves it over time. It also doesn’t require you to worry about where a local clone of that repo might live on your computer or the computers of anyone with which you share your code.

4.5 Additional Resources

Blog with links on the use of projects and here() package
Good advice for folder management in projects.
More good advice on projects and file management