Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak #2866

Closed
pdbailey0 opened this issue May 11, 2018 · 36 comments
Closed

memory leak #2866

pdbailey0 opened this issue May 11, 2018 · 36 comments
Milestone

Comments

@pdbailey0
Copy link

pdbailey0 commented May 11, 2018

Only when using the current 3.6.0 nightly build, I found what appears to be a memory leak in data.table

# Minimal reproducible example

require(data.table)
n <- 2e6
df <- data.frame(a=rnorm(n),
                 b=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]),
                 c=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]))
dt <- setDT(df)
print(pryr::mem_used())
fff <- function(aref) {
  ff <- lapply(1:5, function(i) {
    dt2 <- dt[,list(sumA=sum(get(aref))),by=b][,c:=letters[i]]
    dt2
  })
  return(rbindlist(ff))
}
for(i in 1:10) {
  f <- fff("a")
  rm("f")
  gc()
  print(pryr::mem_used())
}

I would expect the printed memory usage to stay flat (since f is removed and gc() is called) but it goes up. Notice that it always takes some time. adding [,c:=letters[i]] makes it cause problems faster but is not necessary (you may have to increase 10 to 200 or something)

example output

66.6 MB
66.6 MB
66.6 MB
66.6 MB
87.3 MB
190 MB

# Output of sessionInfo()

R Under development (unstable) (2018-05-10 r74708)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.3

loaded via a namespace (and not attached):
[1] compiler_3.6.0   pryr_0.1.4       magrittr_1.5     tools_3.6.0     
[5] Rcpp_0.12.16     stringi_1.1.7    codetools_0.2-15 stringr_1.3.0  

on SO

This is a problem on the CRAN version and the Git Hub version of data.table.

@mattdowle mattdowle added this to the 1.11.4 milestone May 11, 2018
@pdbailey0
Copy link
Author

pdbailey0 commented May 11, 2018

I'll add that the call to get appears to be necessary (sum(get(aref))) but the call need not be in a function call. So, this works

# NOTE: sum(get(aref)) now sum(a)
require(data.table)
n <- 2e6
df <- data.frame(a=rnorm(n),
                 b=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]),
                 c=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]))
dt <- setDT(df)
print(pryr::mem_used())
aref <- "a"
for(i in 1:10) {
  ff <- lapply(1:5, function(i) {
    dt2 <- dt[,list(sumA=sum(a)),by=b][,c:=letters[i]]
    dt2
  })
  f <- rbindlist(ff)
  rm("f")
  gc()
  print(pryr::mem_used())
}
# no memory leak
#66.6 MB (repeated)

but this causes the memory leak

# NOTE: no fff function
require(data.table)
n <- 2e6
df <- data.frame(a=rnorm(n),
                 b=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]),
                 c=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]))
dt <- setDT(df)
print(pryr::mem_used())
aref <- "a"
for(i in 1:10) {
  ff <- lapply(1:5, function(i) {
    dt2 <- dt[,list(sumA=sum(get(aref))),by=b][,c:=letters[i]]
    dt2
  })
  f <- rbindlist(ff)
  rm("f")
  gc()
  print(pryr::mem_used())
}
# 66.6 MB
# 170 MB
# 273 MB
# 375 MB

Interestingly, running the first code chunk after the second and it too has a memory leak, it just can't get it started.

@nguyentr17
Copy link

Using .SDcols also causes memory leak in R 3.6.0.

sessionInfo

require(data.table)
sessionInfo()
# R Under development (unstable) (2018-05-10 r74708)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1
# 
# Matrix products: default
# 
# locale:
#   [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
# [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
# [5] LC_TIME=English_United States.1252    
# 
# attached base packages:
#   [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
#   [1] data.table_1.11.2
# 
# loaded via a namespace (and not attached):
#   [1] compiler_3.6.0 tools_3.6.0 

Example:

Memory leak only occurs when .SDcols is of length greater than 1.

n <- 2e6
df <- data.frame(a=rnorm(n),
                 b=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]),
                 c=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]))
setDT(df)

fff2 <- function(colList) {
  ff <- lapply(1:5, function(i) {
    df[,lapply(.SD,sum),by=b,.SDcols = colList]
  })
  return(NULL)
}
gc()
print(pryr::mem_used())
# .SDcols of length 1 does not have memory leak
for(i in 1:10) {
  f <- fff2(c("a"))
  gc()
  print(pryr::mem_used())
}
# 93 MB
# 93.1 MB
# 93.1 MB
# 93.1 MB
# 93.1 MB
# 93.1 MB
# 93.1 MB
# 93.1 MB
# 93.1 MB
# 93.1 MB

# .SDcols of length > 1 has memory leak
for(i in 1:10) {
  f <- fff2(c("a","a"))
  gc()
  print(pryr::mem_used())
}
# 93.1 MB
# 93.1 MB
# 148 MB
# 291 MB
# 433 MB
# 576 MB
# 719 MB
# 862 MB
# 1 GB
# 1.15 GB

@mattdowle
Copy link
Member

mattdowle commented May 11, 2018

I don't see the leak with these examples on Linux with latest r-devel (well, as of 2nd May, and we've been seeing this problem before then). Looks likely to be Windows-only then, consistent with #2767. That will mean debugging on Windows which I've never done before. I'm surprised a leak like this would be Windows-only as this is more to do with R's internal heap which is platform independent as far as I know.

@pdbailey0
Copy link
Author

We first saw this a few days ago, if that helps.

@mattdowle
Copy link
Member

mattdowle commented May 11, 2018

@pdbailey0 Yes that helps. Have you been routinely using R-devel daily snapshot for more than a few weeks every day, and then only saw the problem a few days ago?

@pdbailey0
Copy link
Author

@nguyentr17 did not see this problem with 3.5.0 pre-release, and I did with 3.6.0 pre-release, do we know whey they switched over? It occurs to me I don't know how often she updated her R-devel.

@mattdowle
Copy link
Member

Ok thanks. I updated R-devel on my laptop to 2018-05-10 (r74708) but still no leak on Linux. I used your tests verbatim and increased the loops to 100 too. We'll just have to bite the bullet and use Windows to trace it...

@HughParsonage
Copy link
Member

Can reproduce on Windows R-devel.

@pdbailey0
Copy link
Author

When I tried to submit my package, the Windows version had an error that I believe indicates data.table is now not available on CRAN's Windows install.

** byte-compile and prepare package for lazy loading
Warning: S3 methods 'all.equal.data.table', 'groupingsets.data.table', 'cube.data.table', 'rollup.data.table', '[.data.table', '[<-.data.table', '$<-.data.table', 'print.data.table', 'as.data.table.data.table', 'as.data.table.data.frame', 'as.data.table.array', 'as.data.table.matrix', 'as.data.table.list', 'as.data.table.integer', 'as.data.table.numeric', 'as.data.table.character', 'as.data.table.logical', 'as.data.table.factor', 'as.data.table.ordered', 'as.data.table.Date', 'as.data.table.ITime', 'as.data.table.table', 'as.data.table.xts', 'as.data.table.default', 'as.data.frame.data.table', 'as.list.data.table', 'as.matrix.data.table', 'split.data.table', 'dim.data.table', 'dimnames.data.table', 'dimnames<-.data.table', 'names<-.data.table', 'duplicated.data.table', 'unique.data.table', 'merge.data.table', 'subset.data.table', 'transform.data.table', 'within.data.table', 'is.na.data.table', 'format.data.table', 'Ops.data.table', 'anyDuplicated.data.table', 'melt.data.table', 'dcast.data.tabl [... truncated]
Error in library.dynam(lib, package, package.lib) : 
  DLL 'datatable' not found: maybe not installed for this architecture?

@HughParsonage
Copy link
Member

I was successfully able to update a package on 2018-05-13. I suspect that you encountered a race condition: try resubmitting.

@pdbailey0
Copy link
Author

@HughParsonage, you were right, I resubmitted and the DLL was present. Unfortunately, the memory leak lead to CRAN running our of memory and me failing a test. I posted asking for help about the immediate issue of my package submission on R-package-devel.

@HughParsonage
Copy link
Member

HughParsonage commented May 15, 2018

Thanks @pdbailey0 : could you provide a link to the package you're attempting to submit? I couldn't see it on your GitHub profile.

To clarify, I don't think this is a problem with your package.

@ltierney
Copy link

Hint: Running with OMP_NUM_THREADS=1 makes the memory problem go away.

Calling DATAPTR for ALTREP objects temporarily disables the GC. Calling DATAPTR from multiple threads without synchronization creates a race condition that can leave the GC disabled, so memory use grows. It's not a leak: signaling a user interrupt, or anything that triggers a jump to top level, re-enables the GC.

@jangorecki
Copy link
Member

@ltierney why it is non consistent across OSes?

@ltierney
Copy link

@jangorecki its a race condition. The risk exists on all platforms. When it bites varies from run to run within platforms, How likely it is to bite varies between platforms with performance differences in particular in thread implementations. On my Ubuntu system where getDTthreads() returns 4 this variant shows the problem on most runs:

require(data.table)
n <- 20
df <- data.frame(a=rnorm(n),
                 b=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]),
                 c=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]))
dt <- setDT(df)
v <- gc()
print(sum(v[, 2]))
fff <- function(aref) {
  ff <- lapply(1:5, function(i) {
    dt2 <- dt[,list(sumA=sum(get(aref))),by=b][,c:=letters[i]]
    dt2
  })
  return(rbindlist(ff))
}
for(i in 1:100) {
  f <- fff("a")
  rm("f")
  v <- gc()
  print(sum(v[, 2]))
}

@mattdowle
Copy link
Member

Many thanks @ltierney ! I see DATAPTR usage within parallel region in reorder.c. I'll start there ...

@pdbailey0
Copy link
Author

I can confirm that using setDTthreads(1) fixes this for me and setting it back to 2+ makes the code show the memory inflation. I couldn't make the problem show up in 3.5.0.

@ltierney
Copy link

ltierney commented May 15, 2018 via email

@mattdowle
Copy link
Member

That commit just now fixes the reprex for me locally. This one was indeed INTEGER inside parallel region on an ALTREP. In subset.c not reorder.c as it turned out. For now, it just calls duplicate() on any ALTREP it is passed. I see that duplicate(x) calls duplicate1(x, deep=TRUE) so I'm guessing I can rely on that to expland an ALTREP (since duplicate1 is hidden from package use).

If this works I'll sweep through all parallel regions similarly.

I saw you reduced n to 20 to reproduce this on Ubuntu. n was 2e6 in the Windows reprex orginally reported. Did you make n smaller so that the vectors would be on the R heap? I think I recall you telling me before that Linux R now allocates large vectors on the OS heap but Windows R still puts them on the R heap. If so, that would be a contributing factor of the apparent Windows-only nature of this particular reprex at the 2e6 size?

@ltierney
Copy link

ltierney commented May 15, 2018 via email

@mattdowle
Copy link
Member

mattdowle commented May 15, 2018

Huge thanks Luke! We would have been stuck for weeks (perhaps longer) if it wasn't for you.

This one fails for me too but gracefully with DT contains an altrep. I added that error in the last commit. Will address now. [ I previously wrote that the error() call was inside a parallel region which I know to be unsafe, but actually it isn't, so should be reliably graceful error, at least in this branch as of now. ]

It seems that the test suite is passing AppVeyor R-devel now, for the first time in a month or so!!! Time will tell. I'll tidy up subset.c and merge it to fix this particular reprex so that folks can start testing 1.11.3. I'll continue to sweep the other parallel regions.

@ltierney
Copy link

ltierney commented May 16, 2018 via email

@mattdowle
Copy link
Member

mattdowle commented May 16, 2018

The trouble with moving them out is I'll then often have to allocate a new vector to hold the vector of safe pointers (what INTEGER and REAL return). For data.tables of 10,000 or more columns there's a RAM usage concern but it's also about the extra code required. It might be faster though I suppose to do that, since iiuc, INTEGER and REAL are slightly heavier now.

As an option for package use, is there a reliable way to expand an ALTREP vector?

@MarkusBonsch
Copy link
Contributor

I don't think that the memory allocation for a vector of pointers is an issue. A pointer is something like 64 bit maximum. So, 1e6 columns would require ~8MB if I am not mistaken. In case the data.table contained all integer columns, this would correspond to the memory needed by two additional rows (assuming 32bit integer). I think, this is definitely acceptable.

What concerns me more is CHARSXPs and VECSXPs. Is it even possible to get rid of SET_STRING_ELT and SET_VECTOR_ELT and replace them by native C code? But then I read that the ELT methods do

Set_elt - Set the value of element i without using dataptr

(https://gmbecker.github.io/jsm2017/jsm2017.html).

So probably this will not be a problem since they explicitely don't need DATAPTR?
I would suggest to review the C code in the long term to move R stuff as far as possible out of parallel code. I will be happy to help on that. On short notice, expanding ALTREP seems like a fast and easy solution if reliably possible.

@ltierney
Copy link

ltierney commented May 16, 2018 via email

@WetRobot
Copy link

For anyone looking for a work-around, I included this in .onAttach:

using_r_devel <- grepl(pattern = "devel", x = R.version$status)
if (using_r_devel) {
    requireNamespace("data.table")
    data.table::setDTthreads(threads = 1L)
}

It passed Winbuilder.

@mattdowle
Copy link
Member

mattdowle commented May 16, 2018

Markus: I shouldn't have written RAM usage. What I had in mind more specifically was cache, not RAM. My laptop has 32KB of L1d, 256K of L2, 6MB L3. When we use multiple threads, that same cache is shared by all the threads (and the OS and other processes). So divide those cache figures by 4 if data.table is using 4 threads. Cache isn't typically much larger on servers, so divide cache by 32 if 32 threads on a server. if we create a new 10,000 item hop vector, then that's 78KB of weight sloshing through cache. Each use of that vector will need that element's cache line (64 bytes on x86). However, thinking about it, in data.table's case most often it wouldn't be an extra hop vector with the base SEXP vector sloshing through cache as well, but rather the parallel code would use the hop vector instead of the base SEXP one. That should be fine then and I agree ncol*8 of working bytes is not a concern even for a 1e6 column dt (8MB). It's certainly cleaner and likely faster to use raw C pointers inside parallel regions, so I'm very comfortable with that. Yes any help much appreciated. I'm looking too so we should coordinate.

Thanks Luke - understood and thanks again. The only place I have deliberately violated the write-barrier on STRSXP and VECSXP is in reorder.c. This is a uniquely special operation : a shuffle, such as performed by setkey(). There it is absolutely guaranteed that all the members of the STRSXP are still there afterwards, they just change position. Say there are 8 STRSXP columns and 8 threads. Each thread gets a column and the 8 columns are shuffled in parallel (the pointers are moved within the vector) but SET_STRING_ELT is not used. This is only ok because it's a shuffle. Otherwise, data.table carefully respects the write barrier (at least, it is intended to).

For those reading that are wondering what the 'write barrier' means (it took me a while to understand it too, with help from Luke in the past), and roughly speaking, SET_STRING_ELT doesn't just assign the pointer (to the CHARSXP in global character cache) to the vector, but it also increments the reference count of the string being assigned and decrements the reference count of the string being replaced. The reference count is on the single global CHARSXP (i.e. shared between threads). So if two threads do that at the same time, the reference count can get messed up (a race). Because an increment (e.g. val++ in C) is actually several operations at machine level: read, add, assign. Two threads could read at the same time and then write the same result back to the variable, causing one increment to be missed and eventually a crash elsewhere. val++ is a write op that is not thread safe in C if val is shared between threads. Parallel C code in data.table is already aware of things like that and should be safe. What's changed is that R API functions like INTEGER() are no longer thread-safe, where they used to be (we relied on undocumented behaviour). The other thing to be aware of is that data.table defines USE_RINTERNALS which allows data.table to use R internals which are not part of the R API; e.g. DATAPTR. Luke and R-core would like data.table, eventually, to remove the use of USE_RINTERNALS and use only R's official C API. In the case of INTEGER() no longer being thread-safe, it would not have helped, since INTEGER is part of R's C API. Rather, w.r.t. INTEGER(), we relied on undocumented thread-safety of an API function. In general, data.table does make Luke's job harder in changing R itself because of our use of USE_RINTERNALS, behind the C API. Our goal is to remove that too and CRAN policy may be revised in future, very reasonably, to not allow USE_RINTERNALS. It's easily quantifiable by removing the #define in data.table.h and seeing what breaks (currently, lots, e.g. 73 uses of DATAPTR). If we start now, we'll find out if there's anything that really can't be overcome, and then we can start a discussion with Luke in good time and possibly get our requests in for additions to the official C API if we can make a compelling case for it. Originally, many years ago, I was motivated to USE_RINTERNALS because that changed INTEGER from a function to a macro and was faster. That is no longer true so we should revisit. The other reason was DATAPTR was a convenient generic way to access the vector's contents for code brevity. But DATAPTR now can allocate so it's no longer just an offset from SEXP.

@HughParsonage
Copy link
Member

@mattdowle I thought only fread, fwrite, and forder were multithreaded (and only became so recently). Is that not the case? Put another way, to what extent was/is data.table reliant on INTEGER's thread-safety?

@MarkusBonsch
Copy link
Contributor

@ltierney and @mattdowle Thank you so much for your explanations and material, I learned a lot.
@mattdowle I see, cache is different, but I would view it as you do: we just replace the vector of SEXPs with a vector of C pointers in the cache. I have some other issues to clarify before I can start on that, and I will get in touch with you before, so we don't conflict.

@mattdowle
Copy link
Member

mattdowle commented May 17, 2018

Hugh: it's a bit more: setkey() reorders columns in parallel, subsetDT at C level is the beginnings of [.data.table replacement in C and is parallel and used in some situations, and fsort is parallel but not used much as it is marked experimental; Markus was the first to use it internally in data.table as he found that coercing integer to double then using fsort (which is only implemented for double currently) was faster and worth the coercion cost. He may have hooked up fsort to forder, I can't quite remember. grep the source for "pragma" to find all parallel regions.

@pdbailey0
Copy link
Author

pdbailey0 commented May 22, 2018

sorry if you expected this, but I was about to test my package vs this new version. First I grabbed master, built it (in 3.5.0) and installed it in 3.6.0 and still got a substantial increase in memory size from my test case (in the OP).

74.9 MB
116 MB
219 MB
322 MB
425 MB
528 MB
631 MB
733 MB
836 MB
939 MB

and from Luke's example (longer output not shown here).

@mattdowle
Copy link
Member

mattdowle commented May 23, 2018

@pdbailey0 Thanks for checking. Could you use the binary .zip that AppVeyor built please and reboot your Windows machine as per the paragraphs here. Suspect it's an upgrade/dll issue. Also, what does test.data.table() return please?

@pdbailey0
Copy link
Author

OK, great, it works now, thanks! It also works great in my package.

@pdbailey0
Copy link
Author

Hi, sorry, I just read your 1.14.4 NEWS and wanted to give credit where credit is due. @nguyentr17 (Trang Nguyen) came up with the minimal working example. Sorry that I wasn't clearer about that above. Can you please attribute this to her and not me?

mattdowle added a commit that referenced this issue May 30, 2018
@mattdowle
Copy link
Member

Sure. Hope that edit cab7c11 just now is ok. Many many thanks @nguyentr17 !

@nguyentr17
Copy link

Thanks for fixing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants