New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dim() on 0-column data.table produced in non-data.table-aware package is wrong #2422
Comments
Actually,
In fact,
I don't know what the result of |
The bug is not whether or not the "actual" dimension is The behaviour of > iris[character(0)]
data frame with 0 columns and 150 rows has very desirable properties, for example one has
Even though |
I understand your point, but still I prefer |
That is just a detail about implementation, though. Having zero column tables with nonzero rows is similar to having zero-length vectors that still have a type. I changed the confusing |
My understanding is Having zero rows table with nonzero columns is similar to having zero-length vectors that still have a type. Also my comment from linked issue:
|
A data.frame is a rectangle list of vectors with the same lengths. Every data.frame knows its own height, and we cannot add a new column with a different length. It means that we cannot add a nonzero-length vector to an existing zero-row data.frame: empty_df = data.frame()
empty_df$newcol = seq_len(nrow(iris)) # Error! Height is incompatible
#> Error in `$<-.data.frame`(`*tmp*`, newcol, value = 1:150): replacement has 150 rows, data has 0
df = iris[,FALSE]
print(df)
#> data frame with 0 columns and 150 rows
attributes(df) # row.names are preserved
#> $names
#> character(0)
#>
#> $class
#> [1] "data.frame"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
#> [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
#> [35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
#> [52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
#> [69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
#> [86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102
#> [103] 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
#> [120] 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
#> [137] 137 138 139 140 141 142 143 144 145 146 147 148 149 150
df$newcol = seq_len(nrow(iris)) # OK, df is derived from iris
dt = data.table::data.table(iris)[,FALSE]
print(dt)
#> Null data.table (0 rows and 0 cols)
attributes(dt) # row.names are discarded
#> $class
#> [1] "data.table" "data.frame"
#>
#> $row.names
#> integer(0)
#>
#> $names
#> character(0)
#>
#> $.internal.selfref
#> <pointer: 0x7f943b01c2e0>
dt$newcol = seq_len(nrow(iris)) # Error! even if dt is derived from iris
#> Error in `[<-.data.table`(x, j = name, value = value): Cannot use := to add columns to a null data.table (no columns), currently. You can use := to add (empty) columns to a 0-row data.table (1 or more empty columns), though. Created on 2019-02-06 by the reprex package (v0.2.1) Keeping |
In other words, tbl = tibble::as_tibble(iris)
dt = data.table::data.table(iris)
ncol(tbl[seq_len(3L),])
#> [1] 5
ncol(tbl[integer(0L),])
#> [1] 5
ncol(dt[seq_len(3L),])
#> [1] 5
ncol(dt[integer(0L),])
#> [1] 5
nrow(tbl[,"Species"])
#> [1] 150
nrow(tbl[,FALSE])
#> [1] 150
nrow(dt[,"Species"])
#> [1] 150
nrow(dt[,FALSE]) # Surprise!
#> [1] 0 Created on 2019-02-06 by the reprex package (v0.2.1) |
What would be also useful is to show practical existing implications of both approaches. For example, if some package breaks, then provide reproducible example. |
OK, let's forget about |
I strongly disagree. A data.frame (and hence also a data.table) is a two-dimensional data structure . That is also why Anyway, there are two issues with the current behavior of data.table:
|
data.frame is two-dimensional data structure but not a (any particular) case of multidimensional data. Because of that we can store different data types in different columns. This is not possible for multidimensional data where column is no different from row, page or any other name you will use instead of integer sequence that maps data into dimensions. Names like
This consistency is exactly what you would expect from multidimensional data, where dimension 1 (lets call it "row") is no different from dimension 2 (lets call it "column"). While in data.frames row is a child of a column. I am not saying we have to strictly align to the above, we already made multiple exceptions just for sake of being consistent to base R. |
related discussion: https://stat.ethz.ch/pipermail/r-devel/2019-May/077796.html |
I just ran into this issue and while I understand (and agree there is some merit to) @jangorecki's "nestedness" argument, I still feel the current data.table behavior is counter-intuitive. Out of the big three data frame structures in R (the other two being |
Related discussion on twitter. I like the idea of returning an object with the row dimensions, analogous to returning an object with the col dimensions in |
I think I fail to see instances where the current behaviour can be considered a feature, not a bug. I created a wrapper class around data.tables that pre-allocates rows (until #660 gets resolved) and not having zero-column data.tables with non-zero rows decreases performance: All pre-allocated rows vanish once the last column in the data.table is removed, and pre-allocation has to happen again once new columns are added. It also complicates code as I have to delay the creation of the underlying data.table until users have provided at least one column. I really wish it would be different, but I started at least documenting the current behavior via #5615. |
If you want to continue thinking of a To be a bit more constructive: I often summarize data like this: iris <- as.data.table(iris)
iris[, .N, by="Species")][, .N:=NULL][] # to get a data.table with the unique Species values I know, I could could also do this: iris[, .unique(.SD), .SDcols="Species"] but if later realize that I actually need some summary its harder to get back iris[, .(n_obs = .N), by="Species")] ideally I'd like to be able to do something iris[, .(), by="Species")] but I need to tell data.table about the as.data.table(iris)[, data.frame(row.names = 1), by=.(Species)] does not work. If |
If a data.table, which is passed to a function of a non-data.table-aware package, is subsetted there, such that a 0-column data.table/data.frame is produced,
dim()
on that data.table/data.frame falsely reports 0 rows.The text was updated successfully, but these errors were encountered: