<- FALSE rerun_setting
2 Cache
We use cache to avoid having to repeat time-consuming calculations when they will return the same result each time. To understand how, you should start with an introduction to cache in RMarkdown that provides an overview of the rationale for caching objects. However, be aware that the discussion is anchored in the context of using knitr’s built in caching abilities, which we don’t recommend.
To use caching effectively, you need to understand when and how to invalidate the cache.
2.1 Solutions
2.1.1 cache = TRUE
(not recommended)
You can set cache = TRUE
in any specific code chunk to have knitr cache those calculations for your later reuse. However, we don’t recommend this because it makes the process and instances where the cache is invalidated more opaque. And more importantly, this caching will not be used for interactive use when you send your code chunks to the console as you work live.
Nonetheless, this approach is well documented including more advanced topics like paths and lazy loading.
2.1.2 Explicit write_rds()
(not recommended
You could instead manually save objects that you want to avoid recalculating. This is a legitimate method that gives you full and transparent control over caching. It will also work both interactively in the console and when you knit your document. However, its got a bit more overhead RE the code. You need to write code to check if the file exists and load it if it does vs. calculate the object if it doesn’t. This is not too hard but it turns out that a function has already been written to handle this overhead for you. We describe that next.
2.1.3 xfun::cache_rds()
(our preferred method)
We believe that the xfun::cache_rds()
function provides the sweet spot for the balance of control and transparency vs. code overhead. It also works for both interactive/console and render workflows.
Lets demonstrate its use.
- First we recommend setting up an environment variable called
rerun_setting
and setting it to FALSE. Put this with your other environment settings at the top of your script so you can find (and change) it easily
Next, we set up some objects that will be used in later time-consuming calculations. You need to be careful with these objects. If they ever change, you will need to explicitly invalidate your cached object and re-calculate it. More on that below.
<- 2
y <- 3 z
Now lets use y
and z
in some time consuming set of calculations
- The first argument parameter in
cache_rds()
is the code to execute the time-consuming calculation. This code is provided to the function inside of curly brackets,{}
- Results from
cache_rds()
are assigned to your object (e.g.,x
) as if they came straight from the coded calculations (e.g., instead ofx <- y + z
, we now havex <- cache_rds({y + z})
). - We recommend explicitly setting the values for the
dir
andfile
for the cached object. This way, you control where it is saved and are assured it will be the same location regardless of whether you run this code chunk in the console or knit it. Initial testing suggested the filename and location will differ for interactive/console vs. rendered workflows if you use defaults. The/
at the end of the directory name is required to designate this as a folder. The filename will have the string assigned tofile
as the prefix but will have an additional hash and a.rds
appended to it as well. - We recommend explicitly including
rerun = rerun_setting
as a third parameter. This provides you an easy way to invalidate the cached object (and a memory aid to consider invalidation when needed). To invalidate just this code chunk, set it toTRUE
and run the chunk again again if any of your globals (e.g.,y
,z
) have changed (and then set back to FALSE after!). We also recommended settingrerun_setting <- TRUE
when you are done with the script and ready to render a final version. This will fix any previously undetected cache issues in your final output.
cache_rds()
has one additional parameter worth mentioning,hash
. You can pass a list of global objects to hash (e.g.hash = list(y, z)
) that the function will monitor for change. If any of these globals are re-calculated, it will invalidate your cached object and re-calculate it. This could be very useful to catch cache invalidation issues. However, our testing suggests that it may invalidate the cache in some instances when it shouldn’t. We haven’t been able to fully document this issue yet. For now, we recommend not using this and instead manually invalidating as needed using eitherrerun = TRUE
orrerun_setting <- TRUE
when you are done with your code.
<- xfun::cache_rds({
x Sys.sleep(5) # pretend that computations take a while
+ z
y
},rerun = rerun_setting,
dir = "cache/",
file = "cache_demo")
Now we can use x without recalculating it each time when executing the previous chunk in either console or when knit. Yay!
x
[1] 5
You can (and should) read the full documentation on xfun::cache_rds() prior to using it in your own code.
2.1.4 Final notes
2.1.4.1 Cache and Github
We may not want our cached objects getting added to our repos. This could make the repos become too large and/or add sensitive data to them in some instance. It is easy to avoid this though. Assuming you are calling the folder that you will save the cached files to cache/
as recommended above, just add the following line to your .gitignore
file
*cache/
2.1.4.2 Saving model objects
We have also learned that caching that involves saving an rds file (all of these methods) may encounter problems if you try to cache a keras model object (e.g., via mlp()
in tidymodels). To be clear, there is no problem saving resampling statistics from fit_resamples()
or tune_grid()
. The problem is specific to the actual model object returned from fit()
. This issue with keras (and perhaps some other types) models is documented and the bundles
package is designed to solve it. If you plan to cache (or even just directly save) these model objects, read these docs carefully. We will eventually work out a piped solution that works to either manually save or use cache_rds()
with these objects if needed. Not a high priority for us right now because we dont use keras much yet in our lab.