Skip to content

Commit

Permalink
Out-of-sample quote rule bump with warning, #2265
Browse files Browse the repository at this point in the history
  • Loading branch information
mattdowle committed Feb 24, 2018
1 parent b392bfe commit c5b5613
Show file tree
Hide file tree
Showing 3 changed files with 40 additions and 19 deletions.
2 changes: 1 addition & 1 deletion NEWS.md
Expand Up @@ -30,7 +30,7 @@
* `skip=` and `nrow=` are more reliable and no longer affected by invalid lines outside the range specified. Thanks to Ziyad Saeed and Kyle Chung for reporting, [#1267](https://github.com/Rdatatable/data.table/issues/1267). Tests added.
* Ram disk (`/dev/shm`) is no longer used for the output of system command input. Although faster when it worked, it was causing too many device full errors; e.g., [#1139](https://github.com/Rdatatable/data.table/issues/1139) and [zUMIs/19](https://github.com/sdparekh/zUMIs/issues/19). Thanks to Kyle Chung for reporting. Standard `tempdir()` is now used. If you wish to use ram disk, set TEMPDIR to `/dev/shm`; see `?tempdir`.
* Detecting whether a very long input string is a file name or data is now much faster, [#2531](https://github.com/Rdatatable/data.table/issues/2531). Many thanks to @javrucebo for the detailed report, benchmarks and suggestions.
* Many thanks to @yaakovfeldman, Guillermo Ponce, Arun Srinivasan, Hugh Parsonage, Mark Klik, Pasha Stetsenko, Mahyar K, Tom Crockett, @cnoelke, @qinjs, @etienne-s, Mark Danese, Avraham Adler, @franknarf1, @MichaelChirico, @tdhock, Luke Tierney for testing dev and reporting these regressions before release to CRAN: [#2070](https://github.com/Rdatatable/data.table/issues/2070), [#2073](https://github.com/Rdatatable/data.table/issues/2073), [#2087](https://github.com/Rdatatable/data.table/issues/2087), [#2091](https://github.com/Rdatatable/data.table/issues/2091), [#2107](https://github.com/Rdatatable/data.table/issues/2107), [fst#50](https://github.com/fstpackage/fst/issues/50#issuecomment-294287846), [#2118](https://github.com/Rdatatable/data.table/issues/2118), [#2092](https://github.com/Rdatatable/data.table/issues/2092), [#1888](https://github.com/Rdatatable/data.table/issues/1888), [#2123](https://github.com/Rdatatable/data.table/issues/2123), [#2167](https://github.com/Rdatatable/data.table/issues/2167), [#2194](https://github.com/Rdatatable/data.table/issues/2194), [#2238](https://github.com/Rdatatable/data.table/issues/2238), [#2228](https://github.com/Rdatatable/data.table/issues/2228), [#1464](https://github.com/Rdatatable/data.table/issues/1464), [#2201](https://github.com/Rdatatable/data.table/issues/2201), [#2287](https://github.com/Rdatatable/data.table/issues/2287), [#2299](https://github.com/Rdatatable/data.table/issues/2299), [#2285](https://github.com/Rdatatable/data.table/issues/2285), [#2251](https://github.com/Rdatatable/data.table/issues/2251), [#2347](https://github.com/Rdatatable/data.table/issues/2347), [#2222](https://github.com/Rdatatable/data.table/issues/2222), [#2352](https://github.com/Rdatatable/data.table/issues/2352), [#2246](https://github.com/Rdatatable/data.table/issues/2246), [#2370](https://github.com/Rdatatable/data.table/issues/2370), [#2371](https://github.com/Rdatatable/data.table/issues/2371), [#2404](https://github.com/Rdatatable/data.table/issues/2404), [#2196](https://github.com/Rdatatable/data.table/issues/2196), [#2322](https://github.com/Rdatatable/data.table/issues/2322), [#2453](https://github.com/Rdatatable/data.table/issues/2453), [#2446](https://github.com/Rdatatable/data.table/issues/2446), [#2464](https://github.com/Rdatatable/data.table/issues/2464), [#2457](https://github.com/Rdatatable/data.table/issues/2457), [#1895](https://github.com/Rdatatable/data.table/issues/1895), [#2481](https://github.com/Rdatatable/data.table/pull/2481), [#2499](https://github.com/Rdatatable/data.table/issues/2499), [#2516](https://github.com/Rdatatable/data.table/issues/2516), [#2520](https://github.com/Rdatatable/data.table/issues/2520), [#2512](https://github.com/Rdatatable/data.table/issues/2512), [#2523](https://github.com/Rdatatable/data.table/issues/2523), [#2542](https://github.com/Rdatatable/data.table/issues/2542), [#2526](https://github.com/Rdatatable/data.table/issues/2526), [#2518](https://github.com/Rdatatable/data.table/issues/2518), [#2515](https://github.com/Rdatatable/data.table/issues/2515), [#1671](https://github.com/Rdatatable/data.table/issues/1671), [#2267](https://github.com/Rdatatable/data.table/issues/2267), [#2561](https://github.com/Rdatatable/data.table/issues/2561), [#2625](https://github.com/Rdatatable/data.table/issues/2625)
* Many thanks to @yaakovfeldman, Guillermo Ponce, Arun Srinivasan, Hugh Parsonage, Mark Klik, Pasha Stetsenko, Mahyar K, Tom Crockett, @cnoelke, @qinjs, @etienne-s, Mark Danese, Avraham Adler, @franknarf1, @MichaelChirico, @tdhock, Luke Tierney for testing dev and reporting these regressions before release to CRAN: [#2070](https://github.com/Rdatatable/data.table/issues/2070), [#2073](https://github.com/Rdatatable/data.table/issues/2073), [#2087](https://github.com/Rdatatable/data.table/issues/2087), [#2091](https://github.com/Rdatatable/data.table/issues/2091), [#2107](https://github.com/Rdatatable/data.table/issues/2107), [fst#50](https://github.com/fstpackage/fst/issues/50#issuecomment-294287846), [#2118](https://github.com/Rdatatable/data.table/issues/2118), [#2092](https://github.com/Rdatatable/data.table/issues/2092), [#1888](https://github.com/Rdatatable/data.table/issues/1888), [#2123](https://github.com/Rdatatable/data.table/issues/2123), [#2167](https://github.com/Rdatatable/data.table/issues/2167), [#2194](https://github.com/Rdatatable/data.table/issues/2194), [#2238](https://github.com/Rdatatable/data.table/issues/2238), [#2228](https://github.com/Rdatatable/data.table/issues/2228), [#1464](https://github.com/Rdatatable/data.table/issues/1464), [#2201](https://github.com/Rdatatable/data.table/issues/2201), [#2287](https://github.com/Rdatatable/data.table/issues/2287), [#2299](https://github.com/Rdatatable/data.table/issues/2299), [#2285](https://github.com/Rdatatable/data.table/issues/2285), [#2251](https://github.com/Rdatatable/data.table/issues/2251), [#2347](https://github.com/Rdatatable/data.table/issues/2347), [#2222](https://github.com/Rdatatable/data.table/issues/2222), [#2352](https://github.com/Rdatatable/data.table/issues/2352), [#2246](https://github.com/Rdatatable/data.table/issues/2246), [#2370](https://github.com/Rdatatable/data.table/issues/2370), [#2371](https://github.com/Rdatatable/data.table/issues/2371), [#2404](https://github.com/Rdatatable/data.table/issues/2404), [#2196](https://github.com/Rdatatable/data.table/issues/2196), [#2322](https://github.com/Rdatatable/data.table/issues/2322), [#2453](https://github.com/Rdatatable/data.table/issues/2453), [#2446](https://github.com/Rdatatable/data.table/issues/2446), [#2464](https://github.com/Rdatatable/data.table/issues/2464), [#2457](https://github.com/Rdatatable/data.table/issues/2457), [#1895](https://github.com/Rdatatable/data.table/issues/1895), [#2481](https://github.com/Rdatatable/data.table/pull/2481), [#2499](https://github.com/Rdatatable/data.table/issues/2499), [#2516](https://github.com/Rdatatable/data.table/issues/2516), [#2520](https://github.com/Rdatatable/data.table/issues/2520), [#2512](https://github.com/Rdatatable/data.table/issues/2512), [#2523](https://github.com/Rdatatable/data.table/issues/2523), [#2542](https://github.com/Rdatatable/data.table/issues/2542), [#2526](https://github.com/Rdatatable/data.table/issues/2526), [#2518](https://github.com/Rdatatable/data.table/issues/2518), [#2515](https://github.com/Rdatatable/data.table/issues/2515), [#1671](https://github.com/Rdatatable/data.table/issues/1671), [#2267](https://github.com/Rdatatable/data.table/issues/2267), [#2561](https://github.com/Rdatatable/data.table/issues/2561), [#2625](https://github.com/Rdatatable/data.table/issues/2625), [#2265](https://github.com/Rdatatable/data.table/issues/2265)

2. `fwrite()`:
* empty strings are now always quoted (`,"",`) to distinguish them from `NA` which by default is still empty (`,,`) but can be changed using `na=` as before. If `na=` is provided and `quote=` is the default `'auto'` then `quote=` is set to `TRUE` so that if the `na=` value occurs in the data, it can be distinguished from `NA`. Thanks to Ethan Welty for the request [#2214](https://github.com/Rdatatable/data.table/issues/2214) and Pasha for the code change and tests, [#2215](https://github.com/Rdatatable/data.table/issues/2215).
Expand Down
14 changes: 11 additions & 3 deletions inst/tests/tests.Rraw
Expand Up @@ -11718,14 +11718,22 @@ unlink(f)
# 1880.1 should fail in there are any duplicate names after a join
# 1880.2 should fail if a warning is not thrown when suffixes leads to duplicate names
parents = data.table(name=c("Sarah", "Max"), sex=c("F", "M"), age=c(41, 43))
children = data.table(parent=c("Sarah", "Max", "Max"),
children = data.table(parent=c("Sarah", "Max", "Max"),
name=c("Oliver", "Sebastian", "Michelle"),
sex=c("M", "M", "F"), age=c(5,8,7))
joined = merge(parents, children, by.x="name", by.y="parent")
test(1880.1, length(names(joined)), length(unique(names(joined))))
joined = merge(parents, children, by.x="name", by.y="parent")
test(1880.1, length(names(joined)), length(unique(names(joined))))
test(1880.2, nrow(merge(parents, children, by.x="name", by.y="parent", suffixes=c("",""))), 3L,
warning = "column names.*are duplicated in the result")

# out-of-sample quote rule bump, #2265
DT = data.table(A=rep("abc", 10000), B="def")
DT[110, A:='"a"b']
fwrite(DT, f<-tempfile(), quote=FALSE)
test(1881.1, ans<-fread(f), DT, warning='Found and resolved improper quoting. First healed line 111: <<"a"b,def>>')
test(1881.2, ans[110,A], '"a"b') # double-check the value of interest directly
unlink(f)


###################################
# Add new tests above this line #
Expand Down
43 changes: 28 additions & 15 deletions src/fread.c
Expand Up @@ -1872,7 +1872,7 @@ int freadMain(freadMainArgs _args) {
//*********************************************************************************************
// [11] Read the data
//*********************************************************************************************
bool stopTeam=false, firstTime=true; // bool for MT-safey (cannot ever read half written bool value)
bool stopTeam=false, firstTime=true, restartTeam=false; // bool for MT-safey (cannot ever read half written bool value badly)
int nTypeBump=0, nTypeBumpCols=0;
double tRead=0, tReread=0;
double thRead=0, thPush=0; // reductions of timings within the parallel region
Expand All @@ -1883,6 +1883,8 @@ int freadMain(freadMainArgs _args) {
size_t DTi = 0; // the current row number in DT that we are writing to
const char *headPos = pos; // the jump start corresponding to DTi
int nSwept = 0; // count the number of dirty jumps that were swept
const char *quoteRuleBumpedCh = NULL; // in the very rare event of an out-of-sample quote rule bump, give a good warning message
size_t quoteRuleBumpedLine = -1;
int buffGrown=0;
// chunkBytes is the distance between each jump point; it decides the number of jumps
// We may want each chunk to write to its own page of the final column, hence 1000*maxLen
Expand Down Expand Up @@ -1920,6 +1922,7 @@ int freadMain(freadMainArgs _args) {

if (verbose) DTPRINT("[11] Read the data\n");
read: // we'll return here to reread any columns with out-of-sample type exceptions, or dirty jumps
restartTeam = false;
if (verbose) DTPRINT(" jumps=[%d..%d), chunk_size=%llu, total_size=%llu\n",
jump0, nJumps, (llu)chunkBytes, (llu)(eof-pos));
ASSERT(allocnrow <= nrowLimit, "allocnrow(%llu) <= nrowLimit(%llu)", (llu)allocnrow, (llu)nrowLimit);
Expand All @@ -1939,7 +1942,7 @@ int freadMain(freadMainArgs _args) {
}
size_t myNrow = 0; // the number of rows in my chunk
size_t myBuffRows = initialBuffRows; // Upon realloc, myBuffRows will increase to grown capacity
bool myStoppingEarly = false; // true when an empty or too-short or too-long row is encountered when fill=false
bool myStopEarly = false; // true when an empty or too-short line is encountered when fill=false, or too-long row

// Allocate thread-private row-major `myBuff`s
ThreadLocalFreadParsingContext ctx = {
Expand Down Expand Up @@ -2122,7 +2125,7 @@ int freadMain(freadMainArgs _args) {

if (thisType != joldType) { // rare out-of-sample type exception.
if (!checkedNumberOfFields && !fill) {
// check this line has the correct number of fields. If not, don't apply the bump from this invalid line. Instead fall through to myStoppingEarly below.
// check this line has the correct number of fields. If not, don't apply the bump from this invalid line. Instead fall through to myStopEarly below.
const char *tt = fieldStart;
int fieldsRemaining = countfields(&tt);
if (j+fieldsRemaining != ncol) break;
Expand Down Expand Up @@ -2157,10 +2160,9 @@ int freadMain(freadMainArgs _args) {
break;
}
if (j<ncol || (!eol(&tch) && *tch!='\0')) {
// Too few or too many columns observed (including empty line). If fill==true, fields should already have been filled
// above due to continue inside while(j<ncol)
// We will push rows read so far and then warn we stopped early.
myStoppingEarly = true;
// Too few or too many columns observed (including empty line). If fill==true, fields should already have been
// filled above due to continue inside while(j<ncol).
myStopEarly = true;
tch = tLineStart;
break;
}
Expand Down Expand Up @@ -2193,20 +2195,29 @@ int freadMain(freadMainArgs _args) {
extraAllocRows = (size_t)((double)(DTi+myNrow)*nJumps/(jump+1) * 1.2) - allocnrow;
if (extraAllocRows < 1024) extraAllocRows = 1024;
myNrow = 0; // discard my buffer even though it was read correctly; this one jump will be reread wastefully in this rare case
stopTeam = true;
stopTeam = restartTeam = true;
jump0 = jump;
}
else {
} else {
// tell next thread 2 things :
headPos = tch; // i) advance headPos; the jump start up to which all rows have been pushed
DTi += myNrow; // ii) which row in the final result next thread should start writing to since now I know myNrow.
ctx.nRows = myNrow;
orderBuffer(&ctx);
if (myStoppingEarly) {
if (myStopEarly) {
if (quoteRule<3) {
quoteRule++;
if (quoteRuleBumpedCh == NULL) {
// for warning message if the quote rule bump does in fact manage to heal it, e.g. test 1881
quoteRuleBumpedCh = tLineStart;
quoteRuleBumpedLine = row1line+DTi;
}
restartTeam = true;
jump0 = jump; // this jump will restart from headPos, not from its beginning, e.g. test 1453
}
stopTeam = true;
} else if (headPos>nextJumpStart) {
nSwept++; // next jump landed awkwardly and will be reread from headPos; i.e. next jump is dirty and will be swept
stopTeam = true;
stopTeam = restartTeam = true;
jump0 = jump+1; // restart team from next jump. jump0 always starts from headPos
// if too many jumps are dirty, scale down to single-threaded to save restarting team too much, wastefully. The file requires
// a single threaded read anyway due to its tortuous complexity with so many embedded newlines so often
Expand Down Expand Up @@ -2251,7 +2262,8 @@ int freadMain(freadMainArgs _args) {
extraAllocRows = 0;
goto read;
}
if (jump0>0 && nSwept>0) {
if (restartTeam) {
ASSERT(nSwept>0 || quoteRuleBumpedCh!=NULL, "Internal error: team restart but nSwept==%d and quoteRuleBumpedCh==%p", nSwept, quoteRuleBumpedCh);
goto read;
}
// else nrowLimit applied and stopped early normally
Expand Down Expand Up @@ -2295,9 +2307,8 @@ int freadMain(freadMainArgs _args) {
// reread from the beginning
DTi = 0;
headPos = pos;
jump0 = 0;
firstTime = false;
nTypeBump = 0; // for test 1328.1. Otherwise the last field would get shifted forwards again.
jump0 = 0; // for #2486
nSwept = 0;
goto read;
}
Expand Down Expand Up @@ -2341,6 +2352,8 @@ int freadMain(freadMainArgs _args) {
DTi+row1line, ncol, tt, strlim(skippedFooter,500));
}
}
} else if (quoteRuleBumpedCh!=NULL) {
DTWARN("Found and resolved improper quoting. First healed line %d: <<%s>>", quoteRuleBumpedLine, strlim(quoteRuleBumpedCh, 500));
}

if (verbose) {
Expand Down

0 comments on commit c5b5613

Please sign in to comment.