Skip to content

Memory usage after using fwrite on a base data frame with a factor column #4571

@mvanhala

Description

@mvanhala

I have encountered an issue with respect to memory usage after using fwrite. Attempting to boil it down, what I observe is that after calling fwrite on a base data.frame containing a factor column, memory usage consistently grows thereafter. The memory usage growth does not occur after calling fwrite on a base data.frame not containing a factor column, nor after calling fwrite on a data.table that contains a factor column.

An example with a base data.frame containing a factor column:

n <- 100000
df <- data.frame(
  x1 = runif(n),
  x2 = runif(n),
  x3 = factor(sample(state.abb, n, replace = TRUE)),
  stringsAsFactors = FALSE
)

# Before fwrite

invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
#> 36.6 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB
#> 40.8 MB

tmp <- tempfile(fileext = ".csv")
data.table::fwrite(df, tmp)
invisible(file.remove(tmp))

# After fwrite

invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
#> 124 MB
#> 204 MB
#> 284 MB
#> 364 MB
#> 444 MB
#> 524 MB
#> 604 MB
#> 684 MB
#> 764 MB
#> 844 MB

If the data.frame does not contain a factor, this issue doesn't occur.

n <- 100000
df <- data.frame(
  x1 = runif(n),
  x2 = runif(n),
  x3 = sample(state.abb, n, replace = TRUE),
  stringsAsFactors = FALSE
)

# Before fwrite

invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
#> 37 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB
#> 41.1 MB

tmp <- tempfile(fileext = ".csv")
data.table::fwrite(df, tmp)
invisible(file.remove(tmp))

# After fwrite

invisible(replicate(10, { mean(runif(10000000)); print(pryr::mem_used()) }))
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB
#> 42.5 MB

Some additional observations:

  • If the data.frame containing a factor is coerced to a data.table before invoking fwrite (i.e., data.table::fwrite(data.table::as.data.table(df), tmp), this memory usage growth did not occur
  • When the data.frame containing a factor contained fewer than 3 columns, I did not observe the memory usage growth
  • Curiously, discovered when trying to create a minimal example, when executed in an R Markdown document with the chunk option error = TRUE set, I did not observe the memory usage growth in the example of a base data.frame containing a factor. When the chunk option error = FALSE was set, though, I did see the memory usage growth.

I got the same behavior on both Ubuntu 18.04 and Windows 10 using v1.12.8 of data.table.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions