2 Cache

We use cache to avoid having to repeat time-consuming calculations when they will return the same result each time. To understand how, you should start with an introduction to cache in RMarkdown that provides an overview of the rationale for caching objects. However, be aware that the discussion is anchored in the context of using knitr’s built in caching abilities, which we don’t recommend.

To use caching effectively, you need to understand when and how to invalidate the cache.

2.1 Solutions

2.1.1 `cache = TRUE` (not recommended)

You can set cache = TRUE in any specific code chunk to have knitr cache those calculations for your later reuse. However, we don’t recommend this because it makes the process and instances where the cache is invalidated more opaque. And more importantly, this caching will not be used for interactive use when you send your code chunks to the console as you work live.

Nonetheless, this approach is well documented including more advanced topics like paths and lazy loading.

2.1.2 Explicit `write_rds()` (OK - we use this in our lab at times)

You could instead manually save objects that you want to avoid recalculating. This is a legitimate method that gives you full and transparent control over caching. It will also work both interactively in the console and when you knit your document. However, its got a bit more overhead RE the code. You need to write code to check if the file exists and load it if it does vs. calculate the object if it doesn’t. This is not too hard but it turns out that a function has already been written to handle this overhead for you. We describe that next.

Here is an example of this brute force (but more transparent) optioni

Make a folder called cache at the root of your project. This is where we will store the file that contains the output of your costly computations.

Then write a if-else statement that checks if that file already exist. If it does, load it. If it does not, do the computations and save the object (so that its available the next time to use)

Now, if you need to update the computations, you must manually delete brute_force_method.rds and then the computations will be recalculated

if (file.exists(here::here("cache/brute_force_method.rds"))){
  
 # file with computations exists, so just load it
 results <- readr::read_rds(here::here("cache/brute_force_method.rds"))
  
} else {
 
  # pretend these computations take a while by calling Sys.sleep ()
  Sys.sleep(10)
  results <- 1 
 
  # write results so we have it the next time we run this code chunk 
  results |>  readr::write_rds(here::here("cache/brute_force_method.rds"))
}

results

[1] 1

2.1.3 `xfun::cache_rds()` (our preferred method)

We believe that the xfun::cache_rds() function provides the sweet spot for the balance of control and transparency vs. code overhead. It also works for both interactive/console and render workflows.

Lets demonstrate its use.

First we recommend setting up an environment variable called rerun_setting and setting it to FALSE. Put this with your other environment settings at the top of your script so you can find (and change) it easily

rerun_setting <- FALSE

Next, we set up some objects that will be used in later time-consuming calculations. You need to be careful with these objects. If they ever change, you will need to explicitly invalidate your cached object and re-calculate it. More on that below.

y <- 2
z <- 3

Now lets use y and z in some time consuming set of calculations

The first argument parameter in cache_rds() is the code to execute the time-consuming calculation. This code is provided to the function inside of curly brackets, {}
Results from cache_rds() are assigned to your object (e.g., x) as if they came straight from the coded calculations (e.g., instead of x <- y + z, we now have x <- cache_rds({y + z})).
We recommend explicitly setting the values for the dir and file for the cached object. This way, you control where it is saved and are assured it will be the same location regardless of whether you run this code chunk in the console or knit it. Initial testing suggested the filename and location will differ for interactive/console vs. rendered workflows if you use defaults. The / at the end of the directory name is required to designate this as a folder. The filename will have the string assigned to file as the prefix but will have an additional hash and a .rds appended to it as well.
We recommend explicitly including rerun = rerun_setting as a third parameter. This provides you an easy way to invalidate the cached object (and a memory aid to consider invalidation when needed). To invalidate just this code chunk, set it to TRUE and run the chunk again again if any of your globals (e.g., y, z) have changed (and then set back to FALSE after!). We also recommended setting rerun_setting <- TRUE when you are done with the script and ready to render a final version. This will fix any previously undetected cache issues in your final output.
cache_rds() has one additional parameter worth mentioning, hash. You can pass a list of global objects to hash (e.g. hash = list(y, z)) that the function will monitor for change. If any of these globals are re-calculated, it will invalidate your cached object and re-calculate it. This could be very useful to catch cache invalidation issues. However, our testing suggests that it may invalidate the cache in some instances when it shouldn’t. We haven’t been able to fully document this issue yet. For now, we recommend not using this and instead manually invalidating as needed using either rerun = TRUE or rerun_setting <- TRUE when you are done with your code.

x <- xfun::cache_rds({
  Sys.sleep(5) # pretend that computations take a while
  y + z
},
rerun = rerun_setting,
dir = "cache/",
file = "cache_demo")

Now we can use x without recalculating it each time when executing the previous chunk in either console or when knit. Yay!

[1] 5

You can (and should) read the full documentation on xfun::cache_rds() prior to using it in your own code.

2.1.4 Final notes

2.1.4.1 Cache and Github

We may not want our cached objects getting added to our repos. This could make the repos become too large and/or add sensitive data to them in some instance. It is easy to avoid this though. Assuming you are calling the folder that you will save the cached files to cache/ as recommended above, just add the following line to your .gitignore file

*cache/

2.1.4.2 Saving model objects

We have also learned that caching that involves saving an rds file (all of these methods) may encounter problems if you try to cache a keras model object (e.g., via mlp() in tidymodels). To be clear, there is no problem saving resampling statistics from fit_resamples() or tune_grid(). The problem is specific to the actual model object returned from fit(). This issue with keras (and perhaps some other types) models is documented and the bundles package is designed to solve it. If you plan to cache (or even just directly save) these model objects, read these docs carefully. We will eventually work out a piped solution that works to either manually save or use cache_rds() with these objects if needed. Not a high priority for us right now because we dont use keras much yet in our lab.