Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #4783 with a designated option #4947

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 2 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,8 @@ Authors@R: c(
person("Vaclav","Tlapak", role="ctb"),
person("Kevin","Ushey", role="ctb"),
person("Dirk","Eddelbuettel", role="ctb"),
person("Ben","Schwen", role="ctb"))
person("Ben","Schwen", role="ctb"),
person("Ofek", "Shilon", role="ctb"))
Depends: R (>= 3.1.0)
Imports: methods
Suggests: bit64 (>= 4.0.0), bit (>= 4.0.4), curl, R.utils, xts, nanotime, zoo (>= 1.8-1), yaml, knitr, rmarkdown
Expand Down
4 changes: 4 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@

2. `fintersect()` now retains the order of the first argument as reasonably expected, rather than retaining the order of the second argument, [#4716](https://github.com/Rdatatable/data.table/issues/4716). Thanks to Michel Lang for reporting, and Ben Schwen for the PR.

3. `setDF()` now clears existing indices, as it does for other `data.table`-only attributes. Doing so prevents some errors that could happen if the `setDT()` is later applied after the indices are mutated; see [#4889](https://github.com/Rdatatable/data.table/issues/4889). Thanks @OfekShilon for the report and the fix.

4. New option added - "datatable.assign.inplace". When set to FALSE, it overrides the default behavior of assigning column data in place, thereby solving [#4783](https://github.com/Rdatatable/data.table/issues/4783) and indirectly also [#3215](https://github.com/Rdatatable/data.table/issues/3215). The extra allocation overhead is typically negligible. Thanks @OfekShilon for the report and the fix.

## NOTES

1. Compiling from source no longer requires `zlib` header files to be available, [#4844](https://github.com/Rdatatable/data.table/pull/4844). The output suggests installing `zlib` headers, and how (e.g. `zlib1g-dev` on Ubuntu) as before, but now proceeds with `gzip` compression disabled in `fwrite`. Upon calling `fwrite(DT, "file.csv.gz")` at runtime, an error message suggests to reinstall `data.table` with `zlib` headers available. This does not apply to users on Windows or Mac who install the pre-compiled binary package from CRAN.
Expand Down
4 changes: 4 additions & 0 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -2654,6 +2654,8 @@ setDF = function(x, rownames=NULL) {
setattr(x, "class", "data.frame")
setattr(x, "sorted", NULL)
setattr(x, ".internal.selfref", NULL)
setattr(x, "index", NULL) # bug 4883
setattr(x, "allow.assign.inplace", NULL)
} else if (is.data.frame(x)) {
if (!is.null(rownames)) {
if (length(rownames) != nrow(x))
Expand Down Expand Up @@ -2728,6 +2730,7 @@ setDT = function(x, keep.rownames=FALSE, key=NULL, check.names=FALSE) {
x[, (nm[1L]) := rn]
setcolorder(x, nm)
}
setattr(x, "allow.assign.inplace", FALSE)
} else if (is.list(x) && length(x)==1L && is.matrix(x[[1L]])) {
# a single list(matrix) is unambiguous and depended on by some revdeps, #3581
x = as.data.table.matrix(x[[1L]])
Expand Down Expand Up @@ -2763,6 +2766,7 @@ setDT = function(x, keep.rownames=FALSE, key=NULL, check.names=FALSE) {
setattr(x,"row.names",.set_row_names(n_range[2L]))
setattr(x,"class",c("data.table","data.frame"))
setalloccol(x)
setattr(x, "allow.assign.inplace", FALSE)
} else {
stop("Argument 'x' to 'setDT' should be a 'list', 'data.frame' or 'data.table'")
}
Expand Down
42 changes: 33 additions & 9 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -7484,15 +7484,22 @@ test(1542.04, key(DT1), "id")
DT1[,newCol:=6:10]
test(1542.05, DT, data.table(id=1:5, key="id"))
test(1542.06, DT1, data.table(id=1:5, newCol=6:10, key="id"))

orig.opt <- options(datatable.assign.inplace=TRUE)
DT1[3, id:=6L]
test(1542.07, DT1, data.table(id=INT(1,2,6,4,5), newCol=6:10))

# current wrong behaviour (invalid key, #3215); root cause is the shallow exposed uniquely via DT[TRUE]
# wrong behaviour (invalid key, #3215); root cause is the shallow exposed uniquely via DT[TRUE]
test(1542.08, DT$id, INT(1,2,6,4,5))
test(1542.09, key(DT), "id")
# future correct behaviour :
# test(1542.10, DT$id, 1:5)
# test(1542.11, key(DT), "id")
# correct behaviour after disabling inplace:
options(datatable.assign.inplace=FALSE)
DT = data.table(id = 1:5, key="id")
DT1 = DT[TRUE]
DT1[,newCol:=6:10]
DT1[3, id:=6L]
test(1542.10, DT$id, 1:5)
test(1542.11, key(DT), "id")
options(orig.opt)

# rest of #1130 - merge doesn't copy, instead uses joins without keys.
set.seed(1L)
Expand Down Expand Up @@ -17264,12 +17271,29 @@ if (identical(x, enc2native(x))) {
# fintersect now preserves order of first argument like intersect, #4716
test(2163, fintersect(data.table(x=c("b", "c", "a")), data.table(x=c("a","c")))$x, c("c", "a"))


# setDF now drops index attributes, #4889
d <- data.table(a=1:100, b=1:100)
setindex(d, a)
setDF(d)
d[1:50, "a"] <- d[51:100, "a"]
setDT(d)
test(2164.1, nrow(d[a==99]), as.integer(2))

# Fix: In-place assignment can leak to origin DF, #4783
d <- data.frame(a=c(1,2), b=c(1,2))
df <- d
setDT(d)
d[!is.na(a), b:=c(20, 21)] # stop the leak to df
test(2164, df$b[1], 1)

options(orig.opt)

# mean na.rm=TRUE GForce, #4849
d = data.table(a=1, b=list(1,2))
test(2164.1, d[, mean(b), by=a], error="not supported by GForce mean")
test(2165.1, d[, mean(b), by=a], error="not supported by GForce mean")
if (test_bit64) {
d = data.table(a=INT(1,1,2,2), b=as.integer64(c(2,3,4,NA)))
test(2164.2, d[, mean(b), by=a], data.table(a=INT(1,2), V1=c(2.5, NA)))
test(2164.3, d[, mean(b, na.rm=TRUE), by=a], data.table(a=INT(1,2), V1=c(2.5, 4)))
test(2165.2, d[, mean(b), by=a], data.table(a=INT(1,2), V1=c(2.5, NA)))
test(2165.3, d[, mean(b, na.rm=TRUE), by=a], data.table(a=INT(1,2), V1=c(2.5, 4)))
}

2 changes: 1 addition & 1 deletion man/setDF.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ setDF(x, rownames=NULL)
}

\details{
All \code{data.table} attributes including any keys of the input data.table are stripped off.
All \code{data.table} attributes including any keys and indices of the input data.table are stripped off.

When using \code{rownames}, recall that the row names of a \code{data.frame} must be unique. By default, the assigned set of row names is simply the sequence 1, \ldots, \code{nrow(x)} (or \code{length(x)} for \code{list}s).
}
Expand Down
12 changes: 11 additions & 1 deletion src/assign.c
Original file line number Diff line number Diff line change
Expand Up @@ -474,6 +474,10 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values)
SET_STRING_ELT(names,oldncol+i,STRING_ELT(newcolnames,i));
// truelengths of both already set by alloccol
}

SEXP attrib = getAttrib(dt, sym_assign_inplace);
Rboolean allow_assign_in_place = isLogical(attrib) && LENGTH(attrib)==1 && LOGICAL(attrib)[0]==1;

for (i=0; i<length(cols); i++) {
coln = INTEGER(cols)[i]-1;
SEXP thisvalue = RHS_list_of_columns ? VECTOR_ELT(values, i) : values;
Expand Down Expand Up @@ -511,9 +515,15 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values)
if (isVectorAtomic(thisvalue)) copyMostAttrib(thisvalue,targetcol); // class etc but not names
// else for lists (such as data.frame and data.table) treat them as raw lists and drop attribs
if (vlen<1) continue; // e.g. DT[,newcol:=integer()] (adding new empty column)
} else { // existing column
}
else if (allow_assign_in_place) {
// overwrite rows in existing column - risking data leakage, #4784
targetcol = VECTOR_ELT(dt,coln);
}
else { // Fix for #4783: "copy on modify" for the entire column
SET_VECTOR_ELT(dt, coln, targetcol=duplicate(VECTOR_ELT(dt, coln)));
}

const char *ret = memrecycle(targetcol, rows, 0, targetlen, thisvalue, 0, -1, coln+1, CHAR(STRING_ELT(names, coln)));
if (ret) warning(ret);
}
Expand Down
1 change: 1 addition & 0 deletions src/data.table.h
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ extern SEXP sym_inherits;
extern SEXP sym_datatable_locked;
extern SEXP sym_tzone;
extern SEXP sym_old_fread_datetime_character;
extern SEXP sym_assign_inplace;
extern double NA_INT64_D;
extern long long NA_INT64_LL;
extern Rcomplex NA_CPLX; // initialized in init.c; see there for comments
Expand Down
2 changes: 2 additions & 0 deletions src/init.c
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ SEXP sym_inherits;
SEXP sym_datatable_locked;
SEXP sym_tzone;
SEXP sym_old_fread_datetime_character;
SEXP sym_assign_inplace;
double NA_INT64_D;
long long NA_INT64_LL;
Rcomplex NA_CPLX;
Expand Down Expand Up @@ -357,6 +358,7 @@ void attribute_visible R_init_datatable(DllInfo *info)
sym_datatable_locked = install(".data.table.locked");
sym_tzone = install("tzone");
sym_old_fread_datetime_character = install("datatable.old.fread.datetime.character");
sym_assign_inplace = install("allow.assign.inplace");

initDTthreads();
avoid_openmp_hang_within_fork();
Expand Down
6 changes: 3 additions & 3 deletions src/utils.c
Original file line number Diff line number Diff line change
Expand Up @@ -381,11 +381,11 @@ SEXP coerceAs(SEXP x, SEXP as, SEXP copyArg) {
#include <zlib.h>
#endif
SEXP dt_zlib_version() {
char out[51];
char out[71];
#ifndef NOZLIB
snprintf(out, 50, "zlibVersion()==%s ZLIB_VERSION==%s", zlibVersion(), ZLIB_VERSION);
snprintf(out, 70, "zlibVersion()==%s ZLIB_VERSION==%s", zlibVersion(), ZLIB_VERSION);
#else
snprintf(out, 50, "zlib header files were not found when data.table was compiled");
snprintf(out, 70, "zlib header files were not found when data.table was compiled");
#endif
return ScalarString(mkChar(out));
}