memory leak #2866

pdbailey0 · 2018-05-11T19:27:34Z

Only when using the current 3.6.0 nightly build, I found what appears to be a memory leak in data.table

# Minimal reproducible example

require(data.table)
n <- 2e6
df <- data.frame(a=rnorm(n),
                 b=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]),
                 c=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]))
dt <- setDT(df)
print(pryr::mem_used())
fff <- function(aref) {
  ff <- lapply(1:5, function(i) {
    dt2 <- dt[,list(sumA=sum(get(aref))),by=b][,c:=letters[i]]
    dt2
  })
  return(rbindlist(ff))
}
for(i in 1:10) {
  f <- fff("a")
  rm("f")
  gc()
  print(pryr::mem_used())
}

I would expect the printed memory usage to stay flat (since f is removed and gc() is called) but it goes up. Notice that it always takes some time. adding [,c:=letters[i]] makes it cause problems faster but is not necessary (you may have to increase 10 to 200 or something)

example output

66.6 MB
66.6 MB
66.6 MB
66.6 MB
87.3 MB
190 MB

# Output of sessionInfo()

R Under development (unstable) (2018-05-10 r74708)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.3

loaded via a namespace (and not attached):
[1] compiler_3.6.0   pryr_0.1.4       magrittr_1.5     tools_3.6.0     
[5] Rcpp_0.12.16     stringi_1.1.7    codetools_0.2-15 stringr_1.3.0

on SO

This is a problem on the CRAN version and the Git Hub version of data.table.

The text was updated successfully, but these errors were encountered:

pdbailey0 · 2018-05-11T21:11:03Z

I'll add that the call to get appears to be necessary (sum(get(aref))) but the call need not be in a function call. So, this works

# NOTE: sum(get(aref)) now sum(a)
require(data.table)
n <- 2e6
df <- data.frame(a=rnorm(n),
                 b=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]),
                 c=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]))
dt <- setDT(df)
print(pryr::mem_used())
aref <- "a"
for(i in 1:10) {
  ff <- lapply(1:5, function(i) {
    dt2 <- dt[,list(sumA=sum(a)),by=b][,c:=letters[i]]
    dt2
  })
  f <- rbindlist(ff)
  rm("f")
  gc()
  print(pryr::mem_used())
}
# no memory leak
#66.6 MB (repeated)

but this causes the memory leak

# NOTE: no fff function
require(data.table)
n <- 2e6
df <- data.frame(a=rnorm(n),
                 b=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]),
                 c=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]))
dt <- setDT(df)
print(pryr::mem_used())
aref <- "a"
for(i in 1:10) {
  ff <- lapply(1:5, function(i) {
    dt2 <- dt[,list(sumA=sum(get(aref))),by=b][,c:=letters[i]]
    dt2
  })
  f <- rbindlist(ff)
  rm("f")
  gc()
  print(pryr::mem_used())
}
# 66.6 MB
# 170 MB
# 273 MB
# 375 MB

Interestingly, running the first code chunk after the second and it too has a memory leak, it just can't get it started.

nguyentr17 · 2018-05-11T21:46:30Z

Using .SDcols also causes memory leak in R 3.6.0.

sessionInfo

require(data.table)
sessionInfo()
# R Under development (unstable) (2018-05-10 r74708)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1
# 
# Matrix products: default
# 
# locale:
#   [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
# [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
# [5] LC_TIME=English_United States.1252    
# 
# attached base packages:
#   [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
#   [1] data.table_1.11.2
# 
# loaded via a namespace (and not attached):
#   [1] compiler_3.6.0 tools_3.6.0

Example:

Memory leak only occurs when .SDcols is of length greater than 1.

n <- 2e6
df <- data.frame(a=rnorm(n),
                 b=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]),
                 c=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]))
setDT(df)

fff2 <- function(colList) {
  ff <- lapply(1:5, function(i) {
    df[,lapply(.SD,sum),by=b,.SDcols = colList]
  })
  return(NULL)
}
gc()
print(pryr::mem_used())
# .SDcols of length 1 does not have memory leak
for(i in 1:10) {
  f <- fff2(c("a"))
  gc()
  print(pryr::mem_used())
}
# 93 MB
# 93.1 MB
# 93.1 MB
# 93.1 MB
# 93.1 MB
# 93.1 MB
# 93.1 MB
# 93.1 MB
# 93.1 MB
# 93.1 MB

# .SDcols of length > 1 has memory leak
for(i in 1:10) {
  f <- fff2(c("a","a"))
  gc()
  print(pryr::mem_used())
}
# 93.1 MB
# 93.1 MB
# 148 MB
# 291 MB
# 433 MB
# 576 MB
# 719 MB
# 862 MB
# 1 GB
# 1.15 GB

mattdowle · 2018-05-11T22:03:04Z

I don't see the leak with these examples on Linux with latest r-devel (well, as of 2nd May, and we've been seeing this problem before then). Looks likely to be Windows-only then, consistent with #2767. That will mean debugging on Windows which I've never done before. I'm surprised a leak like this would be Windows-only as this is more to do with R's internal heap which is platform independent as far as I know.

pdbailey0 · 2018-05-11T22:32:35Z

We first saw this a few days ago, if that helps.

mattdowle · 2018-05-11T23:27:03Z

@pdbailey0 Yes that helps. Have you been routinely using R-devel daily snapshot for more than a few weeks every day, and then only saw the problem a few days ago?

pdbailey0 · 2018-05-12T01:33:07Z

@nguyentr17 did not see this problem with 3.5.0 pre-release, and I did with 3.6.0 pre-release, do we know whey they switched over? It occurs to me I don't know how often she updated her R-devel.

mattdowle · 2018-05-12T04:14:36Z

Ok thanks. I updated R-devel on my laptop to 2018-05-10 (r74708) but still no leak on Linux. I used your tests verbatim and increased the loops to 100 too. We'll just have to bite the bullet and use Windows to trace it...

HughParsonage · 2018-05-12T07:20:46Z

Can reproduce on Windows R-devel.

pdbailey0 · 2018-05-14T15:49:46Z

When I tried to submit my package, the Windows version had an error that I believe indicates data.table is now not available on CRAN's Windows install.

** byte-compile and prepare package for lazy loading
Warning: S3 methods 'all.equal.data.table', 'groupingsets.data.table', 'cube.data.table', 'rollup.data.table', '[.data.table', '[<-.data.table', '$<-.data.table', 'print.data.table', 'as.data.table.data.table', 'as.data.table.data.frame', 'as.data.table.array', 'as.data.table.matrix', 'as.data.table.list', 'as.data.table.integer', 'as.data.table.numeric', 'as.data.table.character', 'as.data.table.logical', 'as.data.table.factor', 'as.data.table.ordered', 'as.data.table.Date', 'as.data.table.ITime', 'as.data.table.table', 'as.data.table.xts', 'as.data.table.default', 'as.data.frame.data.table', 'as.list.data.table', 'as.matrix.data.table', 'split.data.table', 'dim.data.table', 'dimnames.data.table', 'dimnames<-.data.table', 'names<-.data.table', 'duplicated.data.table', 'unique.data.table', 'merge.data.table', 'subset.data.table', 'transform.data.table', 'within.data.table', 'is.na.data.table', 'format.data.table', 'Ops.data.table', 'anyDuplicated.data.table', 'melt.data.table', 'dcast.data.tabl [... truncated]
Error in library.dynam(lib, package, package.lib) : 
  DLL 'datatable' not found: maybe not installed for this architecture?

HughParsonage · 2018-05-14T16:10:46Z

I was successfully able to update a package on 2018-05-13. I suspect that you encountered a race condition: try resubmitting.

pdbailey0 · 2018-05-15T01:35:31Z

@HughParsonage, you were right, I resubmitted and the DLL was present. Unfortunately, the memory leak lead to CRAN running our of memory and me failing a test. I posted asking for help about the immediate issue of my package submission on R-package-devel.

HughParsonage · 2018-05-15T04:01:53Z

Thanks @pdbailey0 : could you provide a link to the package you're attempting to submit? I couldn't see it on your GitHub profile.

To clarify, I don't think this is a problem with your package.

ltierney · 2018-05-15T04:41:57Z

Hint: Running with OMP_NUM_THREADS=1 makes the memory problem go away.

Calling DATAPTR for ALTREP objects temporarily disables the GC. Calling DATAPTR from multiple threads without synchronization creates a race condition that can leave the GC disabled, so memory use grows. It's not a leak: signaling a user interrupt, or anything that triggers a jump to top level, re-enables the GC.

jangorecki · 2018-05-15T08:14:59Z

@ltierney why it is non consistent across OSes?

ltierney · 2018-05-15T11:50:09Z

@jangorecki its a race condition. The risk exists on all platforms. When it bites varies from run to run within platforms, How likely it is to bite varies between platforms with performance differences in particular in thread implementations. On my Ubuntu system where getDTthreads() returns 4 this variant shows the problem on most runs:

require(data.table)
n <- 20
df <- data.frame(a=rnorm(n),
                 b=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]),
                 c=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]))
dt <- setDT(df)
v <- gc()
print(sum(v[, 2]))
fff <- function(aref) {
  ff <- lapply(1:5, function(i) {
    dt2 <- dt[,list(sumA=sum(get(aref))),by=b][,c:=letters[i]]
    dt2
  })
  return(rbindlist(ff))
}
for(i in 1:100) {
  f <- fff("a")
  rm("f")
  v <- gc()
  print(sum(v[, 2]))
}

mattdowle · 2018-05-15T20:46:19Z

Many thanks @ltierney ! I see DATAPTR usage within parallel region in reorder.c. I'll start there ...

pdbailey0 · 2018-05-15T21:08:37Z

I can confirm that using setDTthreads(1) fixes this for me and setting it back to 2+ makes the code show the memory inflation. I couldn't make the problem show up in 3.5.0.

ltierney · 2018-05-15T22:19:57Z

INTEGER, REAL, etc call DATAPTR so are also unsafe. I believe it was INTEGER that was involved in the specific report. It would be safest to get any calls into R out of parallel regions if you can.

…

On Tue, 15 May 2018, Matt Dowle wrote: Many thanks @ltierney ! I see DATAPTR usage within parallel region in reorder.c. I'll start there ... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[ADvaVBgXTt6rdXNrdJXZoM1cY3xhgidNks5tyz6lgaJpZM4T78v2.gif]

-- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu

…devel back on. #2866

mattdowle · 2018-05-15T23:03:52Z

That commit just now fixes the reprex for me locally. This one was indeed INTEGER inside parallel region on an ALTREP. In subset.c not reorder.c as it turned out. For now, it just calls duplicate() on any ALTREP it is passed. I see that duplicate(x) calls duplicate1(x, deep=TRUE) so I'm guessing I can rely on that to expland an ALTREP (since duplicate1 is hidden from package use).

If this works I'll sweep through all parallel regions similarly.

I saw you reduced n to 20 to reproduce this on Ubuntu. n was 2e6 in the Windows reprex orginally reported. Did you make n smaller so that the vectors would be on the R heap? I think I recall you telling me before that Linux R now allocates large vectors on the OS heap but Windows R still puts them on the R heap. If so, that would be a contributing factor of the apparent Windows-only nature of this particular reprex at the 2e6 size?

ltierney · 2018-05-15T23:26:50Z

On Tue, 15 May 2018, Matt Dowle wrote: That commit just now fixes the reprex for me locally. This one was indeed INTEGER inside parallel region on an ALTREP. In subset.c not reorder.c as it turned out. For now, it just calls duplicate() on any ALTREP it is passed. I see that duplicate(x) calls duplicate1(x, deep=TRUE) so I'm guessing I can rely on that to expland an ALTREP (since duplicate1 is hidden from package use). If this works I'll sweep through all parallel regions similarly. I saw you reduced n to 20 to reproduce this on Ubuntu. n was 2e6 in the Windows reprex orginally reported. Did you make n smaller so that the vectors would be on the R heap? I think I recall you telling me before that Linux R now allocates large vectors on the OS heap but Windows R still puts them on the R heap. If so, that would be a contributing factor of the apparent Windows-only nature of this particular reprex at the 2e6 size?

Nothing so sophisticated -- I was just tinkering to see if I could shift the workload timings to make the global updates happen in the wrong order. The version I posted on github now no longer fails for me on Ubuntu -- that's the curse of race conditions. But this one crashes and burns pretty badly and reliably on Ubuntu with two threads (at least at the moment); it's fine single threaded. This also fails for me in R-patched, also on Ubuntu. ```r require(data.table) n <- 20 mkDT <- function() { df <- data.frame(a = 1:n, b = 1:n, c = 1:n, d = 1:n, e = 1:n, f = 1:n, h = 1:n, i = 1:n) setDT(df) } print(sum(gc()[, 2])) fff <- function(aref) { dt <- mkDT() lapply(1:5, function(i) { dt2 <- dt[,list(sumA=sum(get(aref))),by=b][,c:=letters[i]] dt2 }) } for(i in 1:100) { fff("a") print(sum(gc()[, 2])) } ``` Best, luke

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[ADvaVJwGR4gm37L0TqXo64tT8jj9KL-Kks5ty17dgaJpZM4T78v2.gif]

-- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu

mattdowle · 2018-05-15T23:49:54Z

Huge thanks Luke! We would have been stuck for weeks (perhaps longer) if it wasn't for you.

This one fails for me too but gracefully with DT contains an altrep. I added that error in the last commit. Will address now. [ I previously wrote that the error() call was inside a parallel region which I know to be unsafe, but actually it isn't, so should be reliably graceful error, at least in this branch as of now. ]

It seems that the test suite is passing AppVeyor R-devel now, for the first time in a month or so!!! Time will tell. I'll tidy up subset.c and merge it to fix this particular reprex so that folks can start testing 1.11.3. I'll continue to sweep the other parallel regions.

ltierney · 2018-05-16T00:00:01Z

On Tue, 15 May 2018, Matt Dowle wrote: Huge thanks Luke! We would have been stuck for weeks (perhaps longer) if it wasn't for you. This one fails for me too but gracefully with DT contains an altrep. I just added that error in the last commit. But that error() call is within a parallel region so it's very possible crash and burn too. It was just a temporary line (I know error isn't thread-safe) and I'll address that now. It seems that the test suite is passing AppVeyor R-devel now, for the first time in a month or so!!! Time will tell. I'll tidy up subset.c and merge it to fix this particular reprex so that users can start testing it's fixed in 1.11.3. I'll continue to sweep the other parallel regions.

Just another quick note: You mentioned using duplicate for ALTREP's. duplicate isn't guaranteed to return a standard vector -- ALTREP classes can define their own methods that do something else. Rather than test for an ALTREP and duplicate a better approach is to move calls to REAL and anything else that might not be thread safe such out of the parallel region or put them in a critical section. Best, luke

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[ADvaVFmo-9G5aIk9G4nQMuiKVeZqdyagks5ty2mmgaJpZM4T78v2.gif]

-- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu

mattdowle · 2018-05-16T00:09:38Z

The trouble with moving them out is I'll then often have to allocate a new vector to hold the vector of safe pointers (what INTEGER and REAL return). For data.tables of 10,000 or more columns there's a RAM usage concern but it's also about the extra code required. It might be faster though I suppose to do that, since iiuc, INTEGER and REAL are slightly heavier now.

As an option for package use, is there a reliable way to expand an ALTREP vector?

MarkusBonsch · 2018-05-16T07:53:34Z

I don't think that the memory allocation for a vector of pointers is an issue. A pointer is something like 64 bit maximum. So, 1e6 columns would require ~8MB if I am not mistaken. In case the data.table contained all integer columns, this would correspond to the memory needed by two additional rows (assuming 32bit integer). I think, this is definitely acceptable.

What concerns me more is CHARSXPs and VECSXPs. Is it even possible to get rid of SET_STRING_ELT and SET_VECTOR_ELT and replace them by native C code? But then I read that the ELT methods do

Set_elt - Set the value of element i without using dataptr

(https://gmbecker.github.io/jsm2017/jsm2017.html).

So probably this will not be a problem since they explicitely don't need DATAPTR?
I would suggest to review the C code in the long term to move R stuff as far as possible out of parallel code. I will be happy to help on that. On short notice, expanding ALTREP seems like a fast and easy solution if reliably possible.

ltierney · 2018-05-16T13:11:39Z

On Wed, 16 May 2018, MarkusBonsch wrote: I don't think that the memory allocation for a vector of pointers is an issue. A pointer is something like 64 bit maximum. So, 1e6 columns would require ~8MB if I am not mistaken. In case the data.table contained all integer columns, this would correspond to the memory needed by two additional rows (assuming 32bit integer). I think, this is definitely acceptable. What concerns me more is CHARSXPs and VECSXPs. Is it even possible to get rid of SET_STRING_ELT and SET_VECTOR_ELT and replace them by native C code? But then I read that the ELT methods do Set_elt - Set the value of element i without using dataptr

You can only modify STRSXP and VECSXP through the ELT functions since the write barrier needs to see the modifications.

(https://gmbecker.github.io/jsm2017/jsm2017.html).

More info is here: https://svn.r-project.org/R/branches/ALTREP/ALTREP.html

So probably this will not be a problem since they explicitely don't need DATAPTR?

They may still allocate. At this point, DATAPTR, STRING_ELT, and SET_STRING_ELT are functions that now may allocate. There may be more as we go forward. Anything that may allocate may signal an error because the allocation failed. DATAPTR can also signal an error because a particular ALTREP class can't provide a pointer to a contiguous address range. There may be some other functions that now might signal an error but might not have previously. Anything that might modify internal R data structures has to be accessed by only one thread at a time. Anything that might signal an error has to be called from the main thread or a jump will segfault. Even a jump from the main thread could mess up OMP thread management code if it happens inside a parallel region. So basically the only safe option is to move everything that calls into R out of the parallel regions.

I would suggest to review the C code in the long term to move R stuff as far as possible out of parallel code. I will be happy to help on that. On short notice, expanding ALTREP seems like a fast and easy solution if reliably possible.

The notion of expanding isn't general enough. ALTREP is meant to allow objects where obtaining a pointer to contiguous address space isn't feasible (e.g. a large distributed array). Code that requires a DATAPTR will fail on these things but code that has been written to work a chunk at a time when a DATAPTR isn't available will work. Best, luke

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[ADvaVHF7pM3hAcQoCxXuIFzb3lxVYpJiks5ty9sJgaJpZM4T78v2.gif]

-- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu

WetRobot · 2018-05-16T16:43:08Z

For anyone looking for a work-around, I included this in .onAttach:

using_r_devel <- grepl(pattern = "devel", x = R.version$status)
if (using_r_devel) {
    requireNamespace("data.table")
    data.table::setDTthreads(threads = 1L)
}

It passed Winbuilder.

mattdowle · 2018-05-16T19:48:57Z

Markus: I shouldn't have written RAM usage. What I had in mind more specifically was cache, not RAM. My laptop has 32KB of L1d, 256K of L2, 6MB L3. When we use multiple threads, that same cache is shared by all the threads (and the OS and other processes). So divide those cache figures by 4 if data.table is using 4 threads. Cache isn't typically much larger on servers, so divide cache by 32 if 32 threads on a server. if we create a new 10,000 item hop vector, then that's 78KB of weight sloshing through cache. Each use of that vector will need that element's cache line (64 bytes on x86). However, thinking about it, in data.table's case most often it wouldn't be an extra hop vector with the base SEXP vector sloshing through cache as well, but rather the parallel code would use the hop vector instead of the base SEXP one. That should be fine then and I agree ncol*8 of working bytes is not a concern even for a 1e6 column dt (8MB). It's certainly cleaner and likely faster to use raw C pointers inside parallel regions, so I'm very comfortable with that. Yes any help much appreciated. I'm looking too so we should coordinate.

Thanks Luke - understood and thanks again. The only place I have deliberately violated the write-barrier on STRSXP and VECSXP is in reorder.c. This is a uniquely special operation : a shuffle, such as performed by setkey(). There it is absolutely guaranteed that all the members of the STRSXP are still there afterwards, they just change position. Say there are 8 STRSXP columns and 8 threads. Each thread gets a column and the 8 columns are shuffled in parallel (the pointers are moved within the vector) but SET_STRING_ELT is not used. This is only ok because it's a shuffle. Otherwise, data.table carefully respects the write barrier (at least, it is intended to).

For those reading that are wondering what the 'write barrier' means (it took me a while to understand it too, with help from Luke in the past), and roughly speaking, SET_STRING_ELT doesn't just assign the pointer (to the CHARSXP in global character cache) to the vector, but it also increments the reference count of the string being assigned and decrements the reference count of the string being replaced. The reference count is on the single global CHARSXP (i.e. shared between threads). So if two threads do that at the same time, the reference count can get messed up (a race). Because an increment (e.g. val++ in C) is actually several operations at machine level: read, add, assign. Two threads could read at the same time and then write the same result back to the variable, causing one increment to be missed and eventually a crash elsewhere. val++ is a write op that is not thread safe in C if val is shared between threads. Parallel C code in data.table is already aware of things like that and should be safe. What's changed is that R API functions like INTEGER() are no longer thread-safe, where they used to be (we relied on undocumented behaviour). The other thing to be aware of is that data.table defines USE_RINTERNALS which allows data.table to use R internals which are not part of the R API; e.g. DATAPTR. Luke and R-core would like data.table, eventually, to remove the use of USE_RINTERNALS and use only R's official C API. In the case of INTEGER() no longer being thread-safe, it would not have helped, since INTEGER is part of R's C API. Rather, w.r.t. INTEGER(), we relied on undocumented thread-safety of an API function. In general, data.table does make Luke's job harder in changing R itself because of our use of USE_RINTERNALS, behind the C API. Our goal is to remove that too and CRAN policy may be revised in future, very reasonably, to not allow USE_RINTERNALS. It's easily quantifiable by removing the #define in data.table.h and seeing what breaks (currently, lots, e.g. 73 uses of DATAPTR). If we start now, we'll find out if there's anything that really can't be overcome, and then we can start a discussion with Luke in good time and possibly get our requests in for additions to the official C API if we can make a compelling case for it. Originally, many years ago, I was motivated to USE_RINTERNALS because that changed INTEGER from a function to a macro and was faster. That is no longer true so we should revisit. The other reason was DATAPTR was a convenient generic way to access the vector's contents for code brevity. But DATAPTR now can allocate so it's no longer just an offset from SEXP.

HughParsonage · 2018-05-17T03:50:05Z

@mattdowle I thought only fread, fwrite, and forder were multithreaded (and only became so recently). Is that not the case? Put another way, to what extent was/is data.table reliant on INTEGER's thread-safety?

MarkusBonsch · 2018-05-17T18:54:14Z

@ltierney and @mattdowle Thank you so much for your explanations and material, I learned a lot.
@mattdowle I see, cache is different, but I would view it as you do: we just replace the vector of SEXPs with a vector of C pointers in the cache. I have some other issues to clarify before I can start on that, and I will get in touch with you before, so we don't conflict.

mattdowle · 2018-05-17T23:49:15Z

Hugh: it's a bit more: setkey() reorders columns in parallel, subsetDT at C level is the beginnings of [.data.table replacement in C and is parallel and used in some situations, and fsort is parallel but not used much as it is marked experimental; Markus was the first to use it internally in data.table as he found that coercing integer to double then using fsort (which is only implemented for double currently) was faster and worth the coercion cost. He may have hooked up fsort to forder, I can't quite remember. grep the source for "pragma" to find all parallel regions.

pdbailey0 · 2018-05-22T21:41:08Z

sorry if you expected this, but I was about to test my package vs this new version. First I grabbed master, built it (in 3.5.0) and installed it in 3.6.0 and still got a substantial increase in memory size from my test case (in the OP).

74.9 MB
116 MB
219 MB
322 MB
425 MB
528 MB
631 MB
733 MB
836 MB
939 MB

and from Luke's example (longer output not shown here).

mattdowle · 2018-05-23T01:41:34Z

@pdbailey0 Thanks for checking. Could you use the binary .zip that AppVeyor built please and reboot your Windows machine as per the paragraphs here. Suspect it's an upgrade/dll issue. Also, what does test.data.table() return please?

pdbailey0 · 2018-05-23T05:08:43Z

OK, great, it works now, thanks! It also works great in my package.

pdbailey0 · 2018-05-29T19:45:57Z

Hi, sorry, I just read your 1.14.4 NEWS and wanted to give credit where credit is due. @nguyentr17 (Trang Nguyen) came up with the minimal working example. Sorry that I wasn't clearer about that above. Can you please attribute this to her and not me?

mattdowle · 2018-05-30T02:12:37Z

Sure. Hope that edit cab7c11 just now is ok. Many many thanks @nguyentr17 !

nguyentr17 · 2018-05-31T03:13:28Z

Thanks for fixing it!

mattdowle added this to the 1.11.4 milestone May 11, 2018

mattdowle added a commit that referenced this issue May 15, 2018

DATAPTR gc() race in R-devel. Minimal first step for now. AppVeyor R-…

d82e016

…devel back on. #2866

mattdowle mentioned this issue May 15, 2018

gc() race in data.table with R-devel #2882

Merged

4 tasks

mattdowle closed this as completed in #2882 May 22, 2018

mattdowle mentioned this issue May 23, 2018

valgrind uninitialized variables #2850

Closed

mattdowle added a commit that referenced this issue May 30, 2018

Credit to Trang Nguyen, #2866.

cab7c11

mattdowle mentioned this issue Jan 22, 2019

Remove #define USE_RINTERNALS #3301

Closed

12 tasks

DavisVaughan mentioned this issue Jul 24, 2020

Native vec_order() r-lib/vctrs#1143

Merged

memory leak #2866

memory leak #2866

Comments

pdbailey0 commented May 11, 2018 • edited

pdbailey0 commented May 11, 2018 • edited

nguyentr17 commented May 11, 2018

sessionInfo

Example:

mattdowle commented May 11, 2018 • edited

pdbailey0 commented May 11, 2018

mattdowle commented May 11, 2018 • edited

pdbailey0 commented May 12, 2018

mattdowle commented May 12, 2018

HughParsonage commented May 12, 2018

pdbailey0 commented May 14, 2018

HughParsonage commented May 14, 2018

pdbailey0 commented May 15, 2018

HughParsonage commented May 15, 2018 • edited

ltierney commented May 15, 2018

jangorecki commented May 15, 2018

ltierney commented May 15, 2018

mattdowle commented May 15, 2018

pdbailey0 commented May 15, 2018

ltierney commented May 15, 2018 via email

mattdowle commented May 15, 2018

ltierney commented May 15, 2018 via email • edited by jangorecki

mattdowle commented May 15, 2018 • edited

ltierney commented May 16, 2018 via email

mattdowle commented May 16, 2018 • edited

MarkusBonsch commented May 16, 2018

ltierney commented May 16, 2018 via email

WetRobot commented May 16, 2018

mattdowle commented May 16, 2018 • edited

HughParsonage commented May 17, 2018

MarkusBonsch commented May 17, 2018

mattdowle commented May 17, 2018 • edited

pdbailey0 commented May 22, 2018 • edited

mattdowle commented May 23, 2018 • edited

pdbailey0 commented May 23, 2018

pdbailey0 commented May 29, 2018

mattdowle commented May 30, 2018

nguyentr17 commented May 31, 2018

pdbailey0 commented May 11, 2018 •

edited

pdbailey0 commented May 11, 2018 •

edited

mattdowle commented May 11, 2018 •

edited

mattdowle commented May 11, 2018 •

edited

HughParsonage commented May 15, 2018 •

edited

ltierney commented May 15, 2018 via email •

edited by jangorecki

mattdowle commented May 15, 2018 •

edited

mattdowle commented May 16, 2018 •

edited

mattdowle commented May 16, 2018 •

edited

mattdowle commented May 17, 2018 •

edited

pdbailey0 commented May 22, 2018 •

edited

mattdowle commented May 23, 2018 •

edited