Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dim() on 0-column data.table produced in non-data.table-aware package is wrong #2422

Open
akersting opened this issue Oct 16, 2017 · 16 comments · May be fixed by #5918
Open

dim() on 0-column data.table produced in non-data.table-aware package is wrong #2422

akersting opened this issue Oct 16, 2017 · 16 comments · May be fixed by #5918

Comments

@akersting
Copy link

If a data.table, which is passed to a function of a non-data.table-aware package, is subsetted there, such that a 0-column data.table/data.frame is produced, dim() on that data.table/data.frame falsely reports 0 rows.

library(data.table)
X <- data.table(a = 1:10)

# imitate subsetting in function of non-data.table-aware package
Y <- `[.data.frame`(X, , character(), drop = FALSE)

dim.data.frame(Y)  # returns c(10, 0)
dim(Y)  # returns c(0, 0), also in function of non-data.table-aware package
> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4-2

loaded via a namespace (and not attached):
[1] tools_3.4.2 yaml_2.1.14
@MichaelChirico
Copy link
Member

MichaelChirico commented Oct 17, 2017

Actually, Y does have dimension (0, 0) (i.e., the error is not dim's fault):

dput(Y)
# structure(list(), .Names = character(0), 
#           class = c("data.table", "data.frame"), row.names = c(NA, -10L))

In fact, Y is not internally consistent, since row.names has retained the 10-row structure (which is what dim.data.frame uses to get its (10,0)), but the table itself is empty.

print(Y)
# Null data.table (0 rows and 0 cols)

row.names(Y)
# [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

I don't know what the result of X[ , character(0), with = FALSE] should be, TBH

@mb706
Copy link

mb706 commented Apr 21, 2018

The bug is not whether or not the "actual" dimension is (0, 0), but whether or not it should be. data.table represents zero-row-nonzero-column tables just fine (as a nonempty list of zero length vectors), but does not go on to representing nonzero-row-zero-column tables as empty lists of nonempty vectors (think about it). In that sense, Ys internal state is entirely consistent, data.table arguably just chose to interpret it as dimension (0, 0).

The behaviour of data.frame in these cases, which is

> iris[character(0)]
data frame with 0 columns and 150 rows

has very desirable properties, for example one has

all.equal(cbind(df[x], df[y]), df[c(x, y)])

Even though X by 0 and 0 by X data.frames or matrices contain no data, they make edge case behaviour more consistent and are useful for package development (even though they may not help much in interactive sessions).

@jangorecki
Copy link
Member

jangorecki commented Apr 21, 2018

I understand your point, but still I prefer c(0, 0) as correct answer. Rows are childs of columns, if there are no columns there should be no rows returned. As Michael pointed out, it looks more like bug in R. This is a little bit problematic because there is not much control over how non-data.table-aware package will process user data. Eventually good solution would be to handle this edge case by detecting if call like df[character()] (resulting in 0 cols data.table) was made from non-data.table package and then make an exception.

@mb706
Copy link

mb706 commented Apr 21, 2018

Rows are childs of columns, if there are no columns there should be no rows returned

That is just a detail about implementation, though. data.table does handle zero row but nonzero column tables just fine (dt[integer(0), ]). An R matrix can also have zero rows xor zero columns (although in the underlying representation, as a simple vector with additional info of dimensionality, the data is a 0-length vector).

Having zero column tables with nonzero rows is similar to having zero-length vectors that still have a type.

I changed the confusing === notation.

@jangorecki
Copy link
Member

Having zero column tables with nonzero rows is similar to having zero-length vectors that still have a type.

My understanding is

Having zero rows table with nonzero columns is similar to having zero-length vectors that still have a type.

Also my comment from linked issue:

In columnar storage row is a child of column. Without column no rows exists. This make sense for multidimensional structures like vector/matrix/array but not data.frames.

@heavywatal
Copy link
Contributor

A data.frame is a rectangle list of vectors with the same lengths. Every data.frame knows its own height, and we cannot add a new column with a different length. It means that we cannot add a nonzero-length vector to an existing zero-row data.frame:

empty_df = data.frame()
empty_df$newcol = seq_len(nrow(iris))    # Error! Height is incompatible
#> Error in `$<-.data.frame`(`*tmp*`, newcol, value = 1:150): replacement has 150 rows, data has 0

df = iris[,FALSE]
print(df)
#> data frame with 0 columns and 150 rows
attributes(df)                           # row.names are preserved
#> $names
#> character(0)
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#>   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
#>  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
#>  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
#>  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
#>  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
#>  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102
#> [103] 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
#> [120] 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
#> [137] 137 138 139 140 141 142 143 144 145 146 147 148 149 150
df$newcol = seq_len(nrow(iris))          # OK, df is derived from iris

dt = data.table::data.table(iris)[,FALSE]
print(dt)
#> Null data.table (0 rows and 0 cols)
attributes(dt)                           # row.names are discarded
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $row.names
#> integer(0)
#> 
#> $names
#> character(0)
#> 
#> $.internal.selfref
#> <pointer: 0x7f943b01c2e0>
dt$newcol = seq_len(nrow(iris))          # Error! even if dt is derived from iris
#> Error in `[<-.data.table`(x, j = name, value = value): Cannot use := to add columns to a null data.table (no columns), currently. You can use := to add (empty) columns to a 0-row data.table (1 or more empty columns), though.

Created on 2019-02-06 by the reprex package (v0.2.1)

Keeping dim()[1L] and row.names seems more reasonable to me.

@heavywatal
Copy link
Contributor

In other words,
Row subsetting should keep column number.
Column subsetting should keep row number.

tbl = tibble::as_tibble(iris)
dt = data.table::data.table(iris)

ncol(tbl[seq_len(3L),])
#> [1] 5
ncol(tbl[integer(0L),])
#> [1] 5
ncol(dt[seq_len(3L),])
#> [1] 5
ncol(dt[integer(0L),])
#> [1] 5

nrow(tbl[,"Species"])
#> [1] 150
nrow(tbl[,FALSE])
#> [1] 150
nrow(dt[,"Species"])
#> [1] 150
nrow(dt[,FALSE])        # Surprise!
#> [1] 0

Created on 2019-02-06 by the reprex package (v0.2.1)

@jangorecki
Copy link
Member

What would be also useful is to show practical existing implications of both approaches. For example, if some package breaks, then provide reproducible example.
rownames is nothing but a dimension names, which unfortunately are attempting to mimic matrix, where dimension names are perfectly justified. But data.frame is not a multidimensional data structure of any particular dimension (as vector, matrix, arrays) but a list of independent one-dimensional structures - vectors. Restriction that those vectors have to maintain equal length doesn't change much. Dimension names does not fits into data.frame concept. Without particular practical implications of that I am now convinced.

@heavywatal
Copy link
Contributor

OK, let's forget about row.names (I don't like it either) and focus on dim(x)[1L]. I am currently working on a thin igraph wrapper with Rcpp. Edges and vertices themselves are stored in igraph_t object, and their attributes such as names and weights are stored in data.frames, say, Eattr and Vattr. Their row numbers should always remain the same as the edge and vertex numbers, respectively. In this senario, it is quite natural to start from (and sometimes shrink to) zero-column nonzero-row data.frames. If it was not allowed, I would have to switch two different methods for adding a new column to a non-empty data.frame and for adding a first column to a dim c(0, 0) data.frame or null placeholder.

@akersting
Copy link
Author

data.frame is not a multidimensional data structure of any particular dimension

I strongly disagree. A data.frame (and hence also a data.table) is a two-dimensional data structure . That is also why dim returns a vector with two elements.

Anyway, there are two issues with the current behavior of data.table:

  1. It breaks non-data.table-aware packages.
  2. It is inconsistent in itself, since subsetting on one dimension (columns) might change the other one (rows).

@jangorecki
Copy link
Member

jangorecki commented Feb 24, 2019

@akersting

data.frame is not a multidimensional data structure of any particular dimension

I strongly disagree. A data.frame (and hence also a data.table) is a two-dimensional data structure .

data.frame is two-dimensional data structure but not a (any particular) case of multidimensional data. Because of that we can store different data types in different columns. This is not possible for multidimensional data where column is no different from row, page or any other name you will use instead of integer sequence that maps data into dimensions. Names like rows, columns, pages doesn't really have meaning for multidimensional data, they only maps an integer dimension indexes in some visual representation. They are used only when you want to format data for output. This is also the reason why applying transpose function for multidimensional data will never alter the data but only re-arrange along some dimension index, which is not true for data.frames where transpose can alter data.

It is inconsistent in itself, since subsetting on one dimension (columns) might change the other one (rows).

This consistency is exactly what you would expect from multidimensional data, where dimension 1 (lets call it "row") is no different from dimension 2 (lets call it "column"). While in data.frames row is a child of a column.

I am not saying we have to strictly align to the above, we already made multiple exceptions just for sake of being consistent to base R.

@jangorecki
Copy link
Member

related discussion: https://stat.ethz.ch/pipermail/r-devel/2019-May/077796.html

@nbenn
Copy link
Member

nbenn commented Oct 26, 2019

I just ran into this issue and while I understand (and agree there is some merit to) @jangorecki's "nestedness" argument, I still feel the current data.table behavior is counter-intuitive. Out of the big three data frame structures in R (the other two being tibble and base R data.frame), data.table is the only one to interpret a zero-col/nonzero-row data frame in this way. I feel it would make for a smoother user-experience if zero-col/nonzero-row data.table were to become possible.

@brodieG
Copy link

brodieG commented May 6, 2020

Related discussion on twitter.

I like the idea of returning an object with the row dimensions, analogous to returning an object with the col dimensions in iris[0,].

@fkohrt
Copy link
Contributor

fkohrt commented Mar 25, 2023

I think I fail to see instances where the current behaviour can be considered a feature, not a bug. I created a wrapper class around data.tables that pre-allocates rows (until #660 gets resolved) and not having zero-column data.tables with non-zero rows decreases performance: All pre-allocated rows vanish once the last column in the data.table is removed, and pre-allocation has to happen again once new columns are added. It also complicates code as I have to delay the creation of the underlying data.table until users have provided at least one column. I really wish it would be different, but I started at least documenting the current behavior via #5615.

@jan-glx
Copy link
Contributor

jan-glx commented Jan 19, 2024

If you want to continue thinking of a data.tables as a list of vectors instead of something matrix like (which could have dim(3,0)), shouldn't dim return c(NA, 0) for an empty zero column data.table?

To be a bit more constructive:

I often summarize data like this:

iris <- as.data.table(iris)
iris[, .N, by="Species")][, .N:=NULL][] # to get a data.table with the unique Species values

I know, I could could also do this:

iris[, .unique(.SD), .SDcols="Species"]

but if later realize that I actually need some summary its harder to get back

iris[, .(n_obs = .N), by="Species")]

ideally I'd like to be able to do something

iris[, .(), by="Species")]

but I need to tell data.table about the nrow()==1

as.data.table(iris)[, data.frame(row.names = 1), by=.(Species)]

does not work. If data.table would support non-zero-row-zero-column data.tables there would be a natural way to express this.

@ben-schwen ben-schwen linked a pull request Jan 19, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants