-
Notifications
You must be signed in to change notification settings - Fork 1k
Open
Labels
Description
I have encountered an issue with respect to memory usage after using fwrite. Attempting to boil it down, what I observe is that after calling fwrite on a base data.frame containing a factor column, memory usage consistently grows thereafter. The memory usage growth does not occur after calling fwrite on a base data.frame not containing a factor column, nor after calling fwrite on a data.table that contains a factor column.
An example with a base data.frame containing a factor column:
n <- 100000
df <- data.frame(
x1 = runif(n),
x2 = runif(n),
x3 = factor(sample(state.abb, n, replace = TRUE)),
stringsAsFactors = FALSE
)
# Before fwrite
invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
#> 36.6 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
tmp <- tempfile(fileext = ".csv")
data.table::fwrite(df, tmp)
invisible(file.remove(tmp))
# After fwrite
invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
#> 124 MB
#> 204 MB
#> 284 MB
#> 364 MB
#> 444 MB
#> 524 MB
#> 604 MB
#> 684 MB
#> 764 MB
#> 844 MB
If the data.frame does not contain a factor, this issue doesn't occur.
n <- 100000
df <- data.frame(
x1 = runif(n),
x2 = runif(n),
x3 = sample(state.abb, n, replace = TRUE),
stringsAsFactors = FALSE
)
# Before fwrite
invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
#> 37 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
tmp <- tempfile(fileext = ".csv")
data.table::fwrite(df, tmp)
invisible(file.remove(tmp))
# After fwrite
invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
Some additional observations:
- If the
data.framecontaining a factor is coerced to adata.tablebefore invokingfwrite(i.e.,data.table::fwrite(data.table::as.data.table(df), tmp), this memory usage growth did not occur - When the
data.framecontaining a factor contained fewer than 3 columns, I did not observe the memory usage growth - Curiously, discovered when trying to create a minimal example, when executed in an R Markdown document with the chunk option
error = TRUEset, I did not observe the memory usage growth in the example of a basedata.framecontaining a factor. When the chunk optionerror = FALSEwas set, though, I did see the memory usage growth.
I got the same behavior on both Ubuntu 18.04 and Windows 10 using v1.12.8 of data.table.
MichaelChirico