Caching reference objects #2176

rhijmans · 2022-09-27T17:46:14Z

Is there a way, or could support be added, to tell knitr how to serialize objects of a particular class before caching, and how to restore them after caching? See this and this questions on stackoverflow for more context

The "terra" package uses S4 objects with one slot that has a reference to a C++ object (via a Rcpp module). These S4 objects can be serialized and unserialized with very little effort. For example, one can do

library(terra)
f <- system.file("ex/elev.tif", package="terra")
r <- terra::rast(f)  # S4 object that has a reference class
x <- serialize(r, NULL) 
# recreate the object 
s <- terra::rast( unserialize(x) )  
s
#class       : SpatRaster 
#dimensions  : 90, 95, 1  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : 5.741667, 6.533333, 49.44167, 50.19167  (xmin, xmax, ymin, ymax)
#coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#source      : memory 
#name        : elevation 
#min value   :       141 
#max value   :       547

But can one use that mechanism with knitr caching? I suppose that it could be possible, in principle, to specify that objects of class "SpatRaster" should be passed to serialize before caching, and that terra::rast should be used when restoring the object. Whether this is easy to do in practice, I do not know.

By filing an issue to this repo, I promise that

I have fully read the issue guide at https://yihui.org/issue/.
I have provided the necessary information about my issue.
- If I'm asking a question, I have already asked it on Stack Overflow or RStudio Community, waited for at least 24 hours, and included a link to my question there.
- If I'm filing a bug report, I have included a minimal, self-contained, and reproducible example, and have also included xfun::session_info('knitr'). I have upgraded all my packages to their latest versions (e.g., R, RStudio, and R packages), and also tried the development version: remotes::install_github('yihui/knitr').
- If I have posted the same issue elsewhere, I have also mentioned it in this issue.
I have learned the Github Markdown syntax, and formatted my issue correctly.

I understand that my issue may be closed if I don't fulfill my promises.

The text was updated successfully, but these errors were encountered:

yihui · 2022-09-27T21:34:37Z

Okay, I have a better idea on how to implement this now. Basically, as you hinted in the SO posts, neither save() + load() or tools:::makeLazyLoadDB() + lazyLoad() works for these raster objects, but saveRDS() + readRDS() works. That means I can cache to .rds files instead of .RData.

However, again, as you mentioned, users would need to call terra::rast() on the value from readRDS(). Although I could do that automatically in knitr, I would prefer not to deal with this type of special cases unless I must. I wonder if it's possible for the terra package to automatically deal with packed objects. I'm asking since I just realized who you are 20 minutes ago :)

rhijmans · 2022-09-27T23:25:03Z

I agree that you should not have to deal with special cases. Here are some ideas for "terra" and some new things that I have implemented. Feedback appreciated.

It would be possible, I think, to automagically restore the "terra" objects if (a) instead of creating separate packed objects like PackedSpatRaster I would put the data in a slot in the original object (SpatRaster) and (b) any time a SpatRaster is used, a check would be done if it needs to be unpacked. So that means that I would have to add that check to all functions that take a SpatRaster (there are several 100s). A bit of a nightmare but doable. There could also be a huge performance penalty because the unpacking would need to be done every time the object is used (unless it is possible to safely overwrite the object in the global environment). If only there were something like an "onReadRDS" event that could trigger unwrapping...
I have added a method unwrap that could be run after a cached block. It will restore the original object if it is a PackedSpatRaster or PackedSpatVector, and let all other objects pass through. That is much better than using rast because it works in all cases. Whereas if you run rast on a SpatRaster it returns a different object (a template with no cell values). But this would mean that a user would have to remember to call for all terra objects used after a cached block. Not pretty.

library(terra)
f <- system.file("ex/elev.tif", package="terra")
r <- terra::rast(f)  # S4 object that has a reference class
p <- wrap(r)
unwrap(p)

Allow for setting an unwrap method that is called after restoring the cache. Something in the spirit of

cacherestore : terra::unwrap

As all objects could pass through, and a user could provide more specialized functions. That would still be a bit of additional work, but probably a nicer user experience than having to use unwrap on all terra objects in blocks after a cached block.

This may be the best and the worst solution. I have made readRDS and unserialize generic functions and now you get the terra version in you load the package (unless another package does the same and it loaded later). I make them do:

base::readRDS("file.rds")  |> unwrap()
# and
base::unserialzie("file.rds")  |> unwrap()

And now I see this:

x <- serialize(r, NULL) 
y <- unserialize(x)
(y <- unserialize(x))
#class       : SpatRaster 
#dimensions  : 90, 95, 1  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : 5.741667, 6.533333, 49.44167, 50.19167  (xmin, xmax, ymin, ymax)
#coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#source      : memory 
#name        : elevation 
#min value   :       141 
#max value   :       547

And

frds <- "test.rds"
saveRDS(r, frds) 
readRDS(frds)
#class       : SpatRaster 
#dimensions  : 90, 95, 1  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : 5.741667, 6.533333, 49.44167, 50.19167  (xmin, xmax, ymin, ymax)
#coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#source      : memory 
#name        : elevation 
#min value   :       141 
#max value   :       547

This could work well with knitr caching if knitr can use saveRDS. However, while saveRDS has a specific signature for the object to be saved (say, a SpatRaster), the signature for readRDS is "character" (the filename). So this seems a bit fragile as other packages could overwrite it. While method overwriting is not uncommon, this one would be hard to spot for a user and they could not fix it with terra::readRDS as they do not call it.

Also, you have to load the package for this to work. It does not work like this in a clean session:

f <- system.file("ex/elev.tif", package="terra")
r <- terra::rast(f)
frds <- "test.rds"
saveRDS(r, frds) 
readRDS(frds)

This works

terra::saveRDS(r, frds) 
terra::readRDS(frds)

Perhaps an ideal long-term solution would be for base to have a generic unwrap method that is called in readRDS such that packages could implement it for different types of wrapped objects.

yihui · 2022-09-28T21:41:42Z

Thanks for these ideas! For now, I feel the best way might be that I provide an S3 generic function in knitr to process cached objects. Then other package authors like you can register the methods in their own packages. The default method of this function will be simply the identity function, i.e., function(x) x.

rhijmans · 2022-09-28T23:29:19Z

Would the downside of that solution be that these packages then would need to depend on knitr?

The same would be needed when creating the cache, unless you can use saveRDS. Using saveRDS would be great, I think because that is very general.

yihui · 2022-09-29T01:52:51Z

I'll use saveRDS().

You don't need to depend on knitr. Here is the trick we've been using to register S3 methods dynamically on load: https://github.com/rstudio/htmltools/blob/5fa01e7197143844be141dcc0cca85096059497d/R/tags.R#L22-L56

atusy · 2024-04-26T15:24:11Z

@rhijmans @yihui
I opened #2340 to solve this issue, although I am not sure if I fully understand the discussion.

This PR adds knit_cache_preprocess and knit_cache_postprocess who can modify objects being saved and loaded, respectively.
Does it meet @rhijmans 's request?

registerS3method("knit_cache_preprocess", "terra", function(x) {
  # do some modifications at here
}, envir = asNamespace("knitr"))
registerS3method("knit_cache_postprocess", "terra", function(x) {
  # do some modifications at here  
}, envir = asNamespace("knitr"))

This comment was marked as outdated.

Sign in to view

yihui assigned atusy Apr 2, 2024

gavril0 mentioned this issue Apr 17, 2024

Problem caching instances of torch modules and datasets #2339

Open

This comment was marked as duplicate.

Sign in to view

atusy linked a pull request Apr 26, 2024 that will close this issue

customizable cache (closes #2176) #2340

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching reference objects #2176

Caching reference objects #2176

rhijmans commented Sep 27, 2022 •

edited

This comment was marked as outdated.

yihui commented Sep 27, 2022

rhijmans commented Sep 27, 2022 •

edited

yihui commented Sep 28, 2022

rhijmans commented Sep 28, 2022

yihui commented Sep 29, 2022

This comment was marked as duplicate.

atusy commented Apr 26, 2024

Caching reference objects #2176

Caching reference objects #2176

Comments

rhijmans commented Sep 27, 2022 • edited

This comment was marked as outdated.

yihui commented Sep 27, 2022

rhijmans commented Sep 27, 2022 • edited

yihui commented Sep 28, 2022

rhijmans commented Sep 28, 2022

yihui commented Sep 29, 2022

This comment was marked as duplicate.

atusy commented Apr 26, 2024

rhijmans commented Sep 27, 2022 •

edited

rhijmans commented Sep 27, 2022 •

edited