dim() on 0-column data.table produced in non-data.table-aware package is wrong #2422

akersting · 2017-10-16T18:22:54Z

If a data.table, which is passed to a function of a non-data.table-aware package, is subsetted there, such that a 0-column data.table/data.frame is produced, dim() on that data.table/data.frame falsely reports 0 rows.

library(data.table)
X <- data.table(a = 1:10)

# imitate subsetting in function of non-data.table-aware package
Y <- `[.data.frame`(X, , character(), drop = FALSE)

dim.data.frame(Y)  # returns c(10, 0)
dim(Y)  # returns c(0, 0), also in function of non-data.table-aware package

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4-2

loaded via a namespace (and not attached):
[1] tools_3.4.2 yaml_2.1.14

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2017-10-17T01:49:44Z

Actually, Y does have dimension (0, 0) (i.e., the error is not dim's fault):

dput(Y)
# structure(list(), .Names = character(0), 
#           class = c("data.table", "data.frame"), row.names = c(NA, -10L))

In fact, Y is not internally consistent, since row.names has retained the 10-row structure (which is what dim.data.frame uses to get its (10,0)), but the table itself is empty.

print(Y)
# Null data.table (0 rows and 0 cols)

row.names(Y)
# [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

I don't know what the result of X[ , character(0), with = FALSE] should be, TBH

mb706 · 2018-04-21T12:53:41Z

The bug is not whether or not the "actual" dimension is (0, 0), but whether or not it should be. data.table represents zero-row-nonzero-column tables just fine (as a nonempty list of zero length vectors), but does not go on to representing nonzero-row-zero-column tables as empty lists of nonempty vectors (think about it). In that sense, Ys internal state is entirely consistent, data.table arguably just chose to interpret it as dimension (0, 0).

The behaviour of data.frame in these cases, which is

> iris[character(0)]
data frame with 0 columns and 150 rows

has very desirable properties, for example one has

all.equal(cbind(df[x], df[y]), df[c(x, y)])

Even though X by 0 and 0 by X data.frames or matrices contain no data, they make edge case behaviour more consistent and are useful for package development (even though they may not help much in interactive sessions).

jangorecki · 2018-04-21T13:25:27Z

I understand your point, but still I prefer c(0, 0) as correct answer. Rows are childs of columns, if there are no columns there should be no rows returned. As Michael pointed out, it looks more like bug in R. This is a little bit problematic because there is not much control over how non-data.table-aware package will process user data. Eventually good solution would be to handle this edge case by detecting if call like df[character()] (resulting in 0 cols data.table) was made from non-data.table package and then make an exception.

mb706 · 2018-04-21T14:55:01Z

Rows are childs of columns, if there are no columns there should be no rows returned

That is just a detail about implementation, though. data.table does handle zero row but nonzero column tables just fine (dt[integer(0), ]). An R matrix can also have zero rows xor zero columns (although in the underlying representation, as a simple vector with additional info of dimensionality, the data is a 0-length vector).

Having zero column tables with nonzero rows is similar to having zero-length vectors that still have a type.

I changed the confusing === notation.

jangorecki · 2019-02-06T04:37:05Z

Having zero column tables with nonzero rows is similar to having zero-length vectors that still have a type.

My understanding is

Having zero rows table with nonzero columns is similar to having zero-length vectors that still have a type.

Also my comment from linked issue:

In columnar storage row is a child of column. Without column no rows exists. This make sense for multidimensional structures like vector/matrix/array but not data.frames.

heavywatal · 2019-02-06T05:29:12Z

A data.frame is a rectangle list of vectors with the same lengths. Every data.frame knows its own height, and we cannot add a new column with a different length. It means that we cannot add a nonzero-length vector to an existing zero-row data.frame:

empty_df = data.frame()
empty_df$newcol = seq_len(nrow(iris))    # Error! Height is incompatible
#> Error in `$<-.data.frame`(`*tmp*`, newcol, value = 1:150): replacement has 150 rows, data has 0

df = iris[,FALSE]
print(df)
#> data frame with 0 columns and 150 rows
attributes(df)                           # row.names are preserved
#> $names
#> character(0)
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#>   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
#>  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
#>  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
#>  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
#>  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
#>  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102
#> [103] 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
#> [120] 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
#> [137] 137 138 139 140 141 142 143 144 145 146 147 148 149 150
df$newcol = seq_len(nrow(iris))          # OK, df is derived from iris

dt = data.table::data.table(iris)[,FALSE]
print(dt)
#> Null data.table (0 rows and 0 cols)
attributes(dt)                           # row.names are discarded
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $row.names
#> integer(0)
#> 
#> $names
#> character(0)
#> 
#> $.internal.selfref
#> <pointer: 0x7f943b01c2e0>
dt$newcol = seq_len(nrow(iris))          # Error! even if dt is derived from iris
#> Error in `[<-.data.table`(x, j = name, value = value): Cannot use := to add columns to a null data.table (no columns), currently. You can use := to add (empty) columns to a 0-row data.table (1 or more empty columns), though.

^{Created on 2019-02-06 by the reprex package (v0.2.1)}

Keeping dim()[1L] and row.names seems more reasonable to me.

heavywatal · 2019-02-06T06:10:03Z

In other words,
Row subsetting should keep column number.
Column subsetting should keep row number.

tbl = tibble::as_tibble(iris)
dt = data.table::data.table(iris)

ncol(tbl[seq_len(3L),])
#> [1] 5
ncol(tbl[integer(0L),])
#> [1] 5
ncol(dt[seq_len(3L),])
#> [1] 5
ncol(dt[integer(0L),])
#> [1] 5

nrow(tbl[,"Species"])
#> [1] 150
nrow(tbl[,FALSE])
#> [1] 150
nrow(dt[,"Species"])
#> [1] 150
nrow(dt[,FALSE])        # Surprise!
#> [1] 0

^{Created on 2019-02-06 by the reprex package (v0.2.1)}

jangorecki · 2019-02-06T06:30:13Z

What would be also useful is to show practical existing implications of both approaches. For example, if some package breaks, then provide reproducible example.
rownames is nothing but a dimension names, which unfortunately are attempting to mimic matrix, where dimension names are perfectly justified. But data.frame is not a multidimensional data structure of any particular dimension (as vector, matrix, arrays) but a list of independent one-dimensional structures - vectors. Restriction that those vectors have to maintain equal length doesn't change much. Dimension names does not fits into data.frame concept. Without particular practical implications of that I am now convinced.

heavywatal · 2019-02-06T07:42:27Z

OK, let's forget about row.names (I don't like it either) and focus on dim(x)[1L]. I am currently working on a thin igraph wrapper with Rcpp. Edges and vertices themselves are stored in igraph_t object, and their attributes such as names and weights are stored in data.frames, say, Eattr and Vattr. Their row numbers should always remain the same as the edge and vertex numbers, respectively. In this senario, it is quite natural to start from (and sometimes shrink to) zero-column nonzero-row data.frames. If it was not allowed, I would have to switch two different methods for adding a new column to a non-empty data.frame and for adding a first column to a dim c(0, 0) data.frame or null placeholder.

akersting · 2019-02-23T15:20:36Z

data.frame is not a multidimensional data structure of any particular dimension

I strongly disagree. A data.frame (and hence also a data.table) is a two-dimensional data structure . That is also why dim returns a vector with two elements.

Anyway, there are two issues with the current behavior of data.table:

It breaks non-data.table-aware packages.
It is inconsistent in itself, since subsetting on one dimension (columns) might change the other one (rows).

jangorecki · 2019-02-24T05:15:20Z

@akersting

data.frame is not a multidimensional data structure of any particular dimension

I strongly disagree. A data.frame (and hence also a data.table) is a two-dimensional data structure .

data.frame is two-dimensional data structure but not a (any particular) case of multidimensional data. Because of that we can store different data types in different columns. This is not possible for multidimensional data where column is no different from row, page or any other name you will use instead of integer sequence that maps data into dimensions. Names like rows, columns, pages doesn't really have meaning for multidimensional data, they only maps an integer dimension indexes in some visual representation. They are used only when you want to format data for output. This is also the reason why applying transpose function for multidimensional data will never alter the data but only re-arrange along some dimension index, which is not true for data.frames where transpose can alter data.

It is inconsistent in itself, since subsetting on one dimension (columns) might change the other one (rows).

This consistency is exactly what you would expect from multidimensional data, where dimension 1 (lets call it "row") is no different from dimension 2 (lets call it "column"). While in data.frames row is a child of a column.

I am not saying we have to strictly align to the above, we already made multiple exceptions just for sake of being consistent to base R.

jangorecki · 2019-05-17T13:32:26Z

related discussion: https://stat.ethz.ch/pipermail/r-devel/2019-May/077796.html

nbenn · 2019-10-26T15:27:42Z

I just ran into this issue and while I understand (and agree there is some merit to) @jangorecki's "nestedness" argument, I still feel the current data.table behavior is counter-intuitive. Out of the big three data frame structures in R (the other two being tibble and base R data.frame), data.table is the only one to interpret a zero-col/nonzero-row data frame in this way. I feel it would make for a smoother user-experience if zero-col/nonzero-row data.table were to become possible.

brodieG · 2020-05-06T14:00:09Z

Related discussion on twitter.

I like the idea of returning an object with the row dimensions, analogous to returning an object with the col dimensions in iris[0,].

fkohrt · 2023-03-25T10:38:23Z

I think I fail to see instances where the current behaviour can be considered a feature, not a bug. I created a wrapper class around data.tables that pre-allocates rows (until #660 gets resolved) and not having zero-column data.tables with non-zero rows decreases performance: All pre-allocated rows vanish once the last column in the data.table is removed, and pre-allocation has to happen again once new columns are added. It also complicates code as I have to delay the creation of the underlying data.table until users have provided at least one column. I really wish it would be different, but I started at least documenting the current behavior via #5615.

jan-glx · 2024-01-19T16:30:10Z

If you want to continue thinking of a data.tables as a list of vectors instead of something matrix like (which could have dim(3,0)), shouldn't dim return c(NA, 0) for an empty zero column data.table?

To be a bit more constructive:

I often summarize data like this:

iris <- as.data.table(iris)
iris[, .N, by="Species")][, .N:=NULL][] # to get a data.table with the unique Species values

I know, I could could also do this:

iris[, .unique(.SD), .SDcols="Species"]

but if later realize that I actually need some summary its harder to get back

iris[, .(n_obs = .N), by="Species")]

ideally I'd like to be able to do something

iris[, .(), by="Species")]

but I need to tell data.table about the nrow()==1

as.data.table(iris)[, data.frame(row.names = 1), by=.(Species)]

does not work. If data.table would support non-zero-row-zero-column data.tables there would be a natural way to express this.

mb706 mentioned this issue Apr 21, 2018

cpoSelect can't handle data tables mlr-org/mlrCPO#41

Open

MichaelChirico mentioned this issue Nov 19, 2018

Should 0-column data.tables preserve the number of rows? #3155

Closed

jangorecki mentioned this issue Aug 6, 2019

WISH: rbind on data.frames with 0 columns would preserve rows HenrikBengtsson/Wishlist-for-R#77

Open

mgirlich mentioned this issue Dec 10, 2020

Why does dt[character(0L)] not error during tidyr call #4825

Closed

ben-schwen linked a pull request Jan 19, 2024 that will close this issue

add escape for datatable unaware for package #5918

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dim() on 0-column data.table produced in non-data.table-aware package is wrong #2422

dim() on 0-column data.table produced in non-data.table-aware package is wrong #2422

akersting commented Oct 16, 2017

MichaelChirico commented Oct 17, 2017 •

edited

mb706 commented Apr 21, 2018 •

edited

jangorecki commented Apr 21, 2018 •

edited by MichaelChirico

mb706 commented Apr 21, 2018

jangorecki commented Feb 6, 2019

heavywatal commented Feb 6, 2019

heavywatal commented Feb 6, 2019

jangorecki commented Feb 6, 2019

heavywatal commented Feb 6, 2019

akersting commented Feb 23, 2019

jangorecki commented Feb 24, 2019 •

edited

jangorecki commented May 17, 2019

nbenn commented Oct 26, 2019

brodieG commented May 6, 2020

fkohrt commented Mar 25, 2023 •

edited

jan-glx commented Jan 19, 2024

dim() on 0-column data.table produced in non-data.table-aware package is wrong #2422

dim() on 0-column data.table produced in non-data.table-aware package is wrong #2422

Comments

akersting commented Oct 16, 2017

MichaelChirico commented Oct 17, 2017 • edited

mb706 commented Apr 21, 2018 • edited

jangorecki commented Apr 21, 2018 • edited by MichaelChirico

mb706 commented Apr 21, 2018

jangorecki commented Feb 6, 2019

heavywatal commented Feb 6, 2019

heavywatal commented Feb 6, 2019

jangorecki commented Feb 6, 2019

heavywatal commented Feb 6, 2019

akersting commented Feb 23, 2019

jangorecki commented Feb 24, 2019 • edited

jangorecki commented May 17, 2019

nbenn commented Oct 26, 2019

brodieG commented May 6, 2020

fkohrt commented Mar 25, 2023 • edited

jan-glx commented Jan 19, 2024

MichaelChirico commented Oct 17, 2017 •

edited

mb706 commented Apr 21, 2018 •

edited

jangorecki commented Apr 21, 2018 •

edited by MichaelChirico

jangorecki commented Feb 24, 2019 •

edited

fkohrt commented Mar 25, 2023 •

edited