Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Make tidyr functions play nicely with data.table #171

Closed
mindymallory opened this issue Mar 4, 2016 · 3 comments
Closed

Comments

@mindymallory
Copy link

The tidyr functions don't seem to work with data.table functionality. For example, I would like to fill missing values of a large data.table in the following way:

#install.packages("data.table")
#install.packages("tidyr")
library(data.table)
library(tidyr)
dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)

dt[, year := fill(year)]
dt[, c("year", "trt") := .(fill(year), fill(trt))]

but the code above throws the following error:

# > dt[, year := fill(year)]
# Error in UseMethod("fill_") : 
#  no applicable method for 'fill_' applied to an object of class "c('double', 'numeric')"
# > dt[, c("year", "trt") := .(fill(year), fill(trt))]
# Error in UseMethod("fill_") : 
#  no applicable method for 'fill_' applied to an object of class "c('double', 'numeric')"

Since sequentially applying fill() to each column does not modify in place, the method below is costly for large data.tables.

rm(dt)
dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)
tracemem(dt)
dt <- fill(dt, year)
dt <- fill(dt, trt)

# > rm(dt)
# > dt <- data.table(
# +     year = c(2015, NA, NA, NA), 
# +     trt = c("A", NA, "B", NA)
# + )
# > tracemem(dt)
# [1] "<00000000088F6118>"
# > dt <- fill(dt, year)
# tracemem[0x00000000088f6118 -> 0x00000000092d8c50]: fill_.data.frame fill_ fill 
# tracemem[0x00000000092d8c50 -> 0x00000000092d8b00]: [[<-.data.frame [[<- fill_.data.frame  fill_ fill 
# tracemem[0x00000000092d8b00 -> 0x00000000092d8ac8]: [[<-.data.frame [[<- fill_.data.frame  fill_ fill 
# > dt <- fill(dt, trt)
# tracemem[0x00000000092d8ac8 -> 0x000000000951f1c0]: fill_.data.frame fill_ fill 
# tracemem[0x000000000951f1c0 -> 0x000000000951f0a8]: [[<-.data.frame [[<- fill_.data.frame fill_  fill 
# tracemem[0x000000000951f0a8 -> 0x000000000951f070]: [[<-.data.frame [[<- fill_.data.frame fill_   fill 
# > 

Thanks for taking a look!

@hadley
Copy link
Member

hadley commented May 16, 2016

I'd be willing to review pull requests to do this, but I don't have the time (or knowledge of data.table) to do it myself.

@hadley hadley closed this as completed May 16, 2016
@cderv
Copy link
Contributor

cderv commented May 22, 2016

@mindymallory, What you describe is not an issue with tidyr function or new feature to implement. It is rather a matter of how you use tidyr functions with data.table.

There are few error in the tests you've done with dt[, year := fill(year)]

  • First, fill takes 2 arguments with first a data (data.frame, data.table or tbl) then a bare column name. In your test, you specify the column name as first argument so fill won't work independently of data.table.
  • Then, fill does not return the column you specify but the all data you applied fill on. It is why you can't replace year with the result of the fill function in dt[, year := fill(year)]. With LHS:=RHS, RHS must be a vector of replacement values, which is not the case in year := fill(year).

So the code you have tried have these problems independently of tidyr compatibility with data.table.

Here is some ways to use fill inside data.table. It could help other users too.

library(data.table)
library(tidyr)
dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<0000000019B796C8>"

If you want to use fill, you must specify first a data then a column. Inside a data.table, use .SD for that, and see that fill return all the columns of dt

dt[, fill(.SD, year)]
#>    year trt
#> 1: 2015   A
#> 2: 2015  NA
#> 3: 2015   B
#> 4: 2015  NA

You can then select a column with $ for example, and it returns a vector.

dt[, fill(.SD, year)$year]
#> [1] 2015 2015 2015 2015

If you give several columns to fill, it returns these columns filled.

dt[, fill(.SD, year, trt)]
#>    year trt
#> 1: 2015   A
#> 2: 2015   A
#> 3: 2015   B
#> 4: 2015   B

So, how apply fill to replace columns inside a data.table.

You could extract the vector you want and use :=. And do it again for the other colum.

dt[, year := fill(.SD, year)$year]
dt
#>    year trt
#> 1: 2015   A
#> 2: 2015  NA
#> 3: 2015   B
#> 4: 2015  NA
dt[, trt := fill(.SD, trt)$trt]
dt
#>    year trt
#> 1: 2015   A
#> 2: 2015   A
#> 3: 2015   B
#> 4: 2015   B

you could also apply the change at once using data.table syntax.

dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<0000000006104428>"
dt[, c("year", "trt") := .(fill(.SD, year)$year, fill(.SD, trt)$trt)]
dt
#>    year trt
#> 1: 2015   A
#> 2: 2015   A
#> 3: 2015   B
#> 4: 2015   B

However, as fill return all the columns of the data argument, you can do without extracting column with $

dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<0000000005B57B98>"
dt[, c("year", "trt") := fill(.SD, year, trt)]
dt
#>    year trt
#> 1: 2015   A
#> 2: 2015   A
#> 3: 2015   B
#> 4: 2015   B

Finally, you could use an approach more suited for programming.

dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)
tracemem(dt)
#> [1] "<00000000187CF230>"
cols <- c("year", "trt")
dt
#>    year trt
#> 1: 2015   A
#> 2:   NA  NA
#> 3:   NA   B
#> 4:   NA  NA
dt[, (cols) := fill_(.SD, cols), .SDcols = cols]
dt
#>    year trt
#> 1: 2015   A
#> 2: 2015   A
#> 3: 2015   B
#> 4: 2015   B

Note that using this way, you could apply fill easily on a subset of columns

dt <- data.table(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA),
  trt2 = c(NA, "C", NA, "D")
)
tracemem(dt)
#> [1] "<00000000049EFD30>"
cols <- c("year", "trt")
dt
#>    year trt trt2
#> 1: 2015   A   NA
#> 2:   NA  NA    C
#> 3:   NA   B   NA
#> 4:   NA  NA    D
dt[, (cols) := fill_(.SD, cols), .SDcols = cols]
dt
#>    year trt trt2
#> 1: 2015   A   NA
#> 2: 2015   A    C
#> 3: 2015   B   NA
#> 4: 2015   B    D

Hope this answer your question and the feature request you want. Note that I use tracemem to show that there are no copies and that tidyr is compatible with data.table concept, as data.table syntax and concept are used correctly.

@hadley, I think you won't have any PR to review on this subject.

@mindymallory
Copy link
Author

@cderv Thanks for this detailed explanation! You've saved me a lot of memory usage!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants