Better jump sync and run-on #2627

mattdowle · 2018-02-14T03:51:42Z

Closes #2267
Closes #2561

nextGoodLine() simplified to always return a good guess rather than potentially fail with false. This removes the fail points such as "No good line" and "did not finish exactly".

Only the first sample of 100 lines can now bump the quoteRule. Later sample jumps skip rather than bumping quoteRule because it's more likely that sample jump point was to inside a quoted field. The thread at headPos when reading (which has full lineage from the sof) will bump the quoteRule with certainty then, if needed. This prepares for #2265.

Dirty jump sweeping implemented. Thread at headPos now knows if it finished after nextJumpStart; which will happen if nextJumpStart landed awkwardly inside a quoted field with many newlines causing nextGoodLine to pick a false newline. In this case, jump0 is set to reread the dirty chunk and the team restarted. If too many dirty jumps are encountered, a single thread with full lineage from sof will be needed anyway, so the team is downgraded to single-threaded from that point on. In most cases of a few dirty jumps, they will be swept by the thread at headPos and the remainder of the file continued to be read in parallel.

simulate bad_fill.csv from fread fails on file with inconsistent # columns #2267. At 35MB it's too big for CRAN but works now.
simulate issue682.txt from Jump 0 did not finish counting rows exactly where jump 1 found its first good line start #2561. At 2.1MB it's too big for CRAN but works now.
simulate bad jump point by constructing an embedded field with 5 newlines

codecov-io · 2018-02-14T04:20:44Z

Codecov Report

Merging #2627 into master will increase coverage by 0.07%.
The diff coverage is 95.65%.

@@            Coverage Diff             @@
##           master    #2627      +/-   ##
==========================================
+ Coverage   92.95%   93.02%   +0.07%     
==========================================
  Files          61       61              
  Lines       12129    12123       -6     
==========================================
+ Hits        11275    11278       +3     
+ Misses        854      845       -9

Impacted Files	Coverage Δ
R/print.data.table.R	`98.13% <100%> (ø)`	⬆️
src/fread.c	`97.11% <95.6%> (+0.72%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e5c0eda...3eec5a8. Read the comment docs.

…ndows fail in AppVeyor log

st-pasha

Superb! These "invalid chunks" errors were some of the most annoying issues in fread, and now it seems you solved them.

st-pasha · 2018-02-16T00:15:00Z

NEWS.md

@@ -28,7 +28,7 @@
    * Single-column input with blank lines is now valid and the blank lines are significant (meaning an NA in the single column). The blank lines are significant even at the very end, which may be surprising on first glance. The change is so that `fread(fwrite(DT))==DT` for single-column inputs containing NA which are written as blank. There is no change when `ncol>1` (i.e., input stops with detailed warning at the first blank line) because a blank line when `ncol>1` is invalid input due to no separators present instead of `ncol-1` separators.
    * Too few column names are now auto filled with default column names, with warning, [#1625](https://github.com/Rdatatable/data.table/issues/1625). If there is just one missing column name it is guessed to be for the first column (row names or an index), otherwise the column names are filled at the end. Similarly, too many column names now automatically sets `fill=TRUE`, with warning.
    * `skip=` and `nrow=` are more reliable and no longer affected by invalid lines outside the range specified. Thanks to Ziyad Saeed and Kyle Chung for reporting, [#1267](https://github.com/Rdatatable/data.table/issues/1267). Tests added.
-    * Many thanks to @yaakovfeldman, Guillermo Ponce, Arun Srinivasan, Hugh Parsonage, Mark Klik, Pasha Stetsenko, Mahyar K, Tom Crockett, @cnoelke, @qinjs, @etienne-s, Mark Danese, Avraham Adler, @franknarf1, @MichaelChirico, @tdhock, Luke Tierney for testing before release to CRAN: [#2070](https://github.com/Rdatatable/data.table/issues/2070), [#2073](https://github.com/Rdatatable/data.table/issues/2073), [#2087](https://github.com/Rdatatable/data.table/issues/2087), [#2091](https://github.com/Rdatatable/data.table/issues/2091), [#2107](https://github.com/Rdatatable/data.table/issues/2107), [fst#50](https://github.com/fstpackage/fst/issues/50#issuecomment-294287846), [#2118](https://github.com/Rdatatable/data.table/issues/2118), [#2092](https://github.com/Rdatatable/data.table/issues/2092), [#1888](https://github.com/Rdatatable/data.table/issues/1888), [#2123](https://github.com/Rdatatable/data.table/issues/2123), [#2167](https://github.com/Rdatatable/data.table/issues/2167), [#2194](https://github.com/Rdatatable/data.table/issues/2194), [#2238](https://github.com/Rdatatable/data.table/issues/2238), [#2228](https://github.com/Rdatatable/data.table/issues/2228), [#1464](https://github.com/Rdatatable/data.table/issues/1464), [#2201](https://github.com/Rdatatable/data.table/issues/2201), [#2287](https://github.com/Rdatatable/data.table/issues/2287), [#2299](https://github.com/Rdatatable/data.table/issues/2299), [#2285](https://github.com/Rdatatable/data.table/issues/2285), [#2251](https://github.com/Rdatatable/data.table/issues/2251), [#2347](https://github.com/Rdatatable/data.table/issues/2347), [#2222](https://github.com/Rdatatable/data.table/issues/2222), [#2352](https://github.com/Rdatatable/data.table/issues/2352), [#2246](https://github.com/Rdatatable/data.table/issues/2246), [#2370](https://github.com/Rdatatable/data.table/issues/2370), [#2371](https://github.com/Rdatatable/data.table/issues/2371), [#2404](https://github.com/Rdatatable/data.table/issues/2404), [#2196](https://github.com/Rdatatable/data.table/issues/2196), [#2322](https://github.com/Rdatatable/data.table/issues/2322), [#2453](https://github.com/Rdatatable/data.table/issues/2453), [#2446](https://github.com/Rdatatable/data.table/issues/2446), [#2464](https://github.com/Rdatatable/data.table/issues/2464), [#2457](https://github.com/Rdatatable/data.table/issues/2457), [#1895](https://github.com/Rdatatable/data.table/issues/1895), [#2481](https://github.com/Rdatatable/data.table/pull/2481), [#2499](https://github.com/Rdatatable/data.table/issues/2499), [#2516](https://github.com/Rdatatable/data.table/issues/2516), [#2520](https://github.com/Rdatatable/data.table/issues/2520), [#2512](https://github.com/Rdatatable/data.table/issues/2512), [#2523](https://github.com/Rdatatable/data.table/issues/2523), [#2542](https://github.com/Rdatatable/data.table/issues/2542), [#2526](https://github.com/Rdatatable/data.table/issues/2526), [#2518](https://github.com/Rdatatable/data.table/issues/2518), [#2515](https://github.com/Rdatatable/data.table/issues/2515), [#1671](https://github.com/Rdatatable/data.table/issues/1671)
+    * Many thanks to @yaakovfeldman, Guillermo Ponce, Arun Srinivasan, Hugh Parsonage, Mark Klik, Pasha Stetsenko, Mahyar K, Tom Crockett, @cnoelke, @qinjs, @etienne-s, Mark Danese, Avraham Adler, @franknarf1, @MichaelChirico, @tdhock, Luke Tierney for testing before release to CRAN: [#2070](https://github.com/Rdatatable/data.table/issues/2070), [#2073](https://github.com/Rdatatable/data.table/issues/2073), [#2087](https://github.com/Rdatatable/data.table/issues/2087), [#2091](https://github.com/Rdatatable/data.table/issues/2091), [#2107](https://github.com/Rdatatable/data.table/issues/2107), [fst#50](https://github.com/fstpackage/fst/issues/50#issuecomment-294287846), [#2118](https://github.com/Rdatatable/data.table/issues/2118), [#2092](https://github.com/Rdatatable/data.table/issues/2092), [#1888](https://github.com/Rdatatable/data.table/issues/1888), [#2123](https://github.com/Rdatatable/data.table/issues/2123), [#2167](https://github.com/Rdatatable/data.table/issues/2167), [#2194](https://github.com/Rdatatable/data.table/issues/2194), [#2238](https://github.com/Rdatatable/data.table/issues/2238), [#2228](https://github.com/Rdatatable/data.table/issues/2228), [#1464](https://github.com/Rdatatable/data.table/issues/1464), [#2201](https://github.com/Rdatatable/data.table/issues/2201), [#2287](https://github.com/Rdatatable/data.table/issues/2287), [#2299](https://github.com/Rdatatable/data.table/issues/2299), [#2285](https://github.com/Rdatatable/data.table/issues/2285), [#2251](https://github.com/Rdatatable/data.table/issues/2251), [#2347](https://github.com/Rdatatable/data.table/issues/2347), [#2222](https://github.com/Rdatatable/data.table/issues/2222), [#2352](https://github.com/Rdatatable/data.table/issues/2352), [#2246](https://github.com/Rdatatable/data.table/issues/2246), [#2370](https://github.com/Rdatatable/data.table/issues/2370), [#2371](https://github.com/Rdatatable/data.table/issues/2371), [#2404](https://github.com/Rdatatable/data.table/issues/2404), [#2196](https://github.com/Rdatatable/data.table/issues/2196), [#2322](https://github.com/Rdatatable/data.table/issues/2322), [#2453](https://github.com/Rdatatable/data.table/issues/2453), [#2446](https://github.com/Rdatatable/data.table/issues/2446), [#2464](https://github.com/Rdatatable/data.table/issues/2464), [#2457](https://github.com/Rdatatable/data.table/issues/2457), [#1895](https://github.com/Rdatatable/data.table/issues/1895), [#2481](https://github.com/Rdatatable/data.table/pull/2481), [#2499](https://github.com/Rdatatable/data.table/issues/2499), [#2516](https://github.com/Rdatatable/data.table/issues/2516), [#2520](https://github.com/Rdatatable/data.table/issues/2520), [#2512](https://github.com/Rdatatable/data.table/issues/2512), [#2523](https://github.com/Rdatatable/data.table/issues/2523), [#2542](https://github.com/Rdatatable/data.table/issues/2542), [#2526](https://github.com/Rdatatable/data.table/issues/2526), [#2518](https://github.com/Rdatatable/data.table/issues/2518), [#2515](https://github.com/Rdatatable/data.table/issues/2515), [#1671](https://github.com/Rdatatable/data.table/issues/1671), [#2267](https://github.com/Rdatatable/data.table/issues/2267), [#2561](https://github.com/Rdatatable/data.table/issues/2561)


Could we perhaps split this into multiple lines (one per issue)? It's already grew so big that it's impossible to know what was added/deleted, and it will probably get even bigger in the future...

Could do. I'm not quite sure how to split it though. The idea of this item in NEWS is to just merely to convey to lay readers of NEWS on release to CRAN that lots of people have helped to find and fix lots of problems. I'm not expecting anyone at all to click through them all. If we want to know what's been done and when in more detail, we'd use the milestone tag instead. I'm only adding to the end, so the diff is only ever at the end (or adding new people to the end of the list at the beginning of this item). Also, many of these items are fixes to new problems that have been created by going parallel in dev. NEWS is only really supposed to cover user changes from the last released version to CRAN.

You could just put every pull request link on a new line. I believe markdown ignores single new lines (like TeX).

I'm fine with it as it is, but, it could be worth writing a short blog-post-type document with release to highlight all the work that has been done, and refer in NEWS to that ("see [this]() post for more")

mattdowle added 2 commits February 13, 2018 18:59

Interim. Passing tests.

15732b8

Tidy

8867fd9

mattdowle added this to the v1.10.6 milestone Feb 14, 2018

mattdowle added 10 commits February 13, 2018 20:51

Added test for #2561

2ca20ae

Tidied test

fcc7c23

Added test for out-of-sample short lines in the first jump

bb539b3

Added headPos. Passing tests. Interim.

7f3f5c3

Added test for #2267

7516eba

Tidied ordered section

5d8030e

Merge branch 'master' into jumpsync

654cd0a

Dirty jump sweeping implemented and test added.

6de7e00

Change compactprint() to show 5 rows, to reveal data difference on Wi…

c03189c

…ndows fail in AppVeyor log

Fixed new test for Windows

3eec5a8

mattdowle requested a review from st-pasha February 15, 2018 22:52

st-pasha approved these changes Feb 16, 2018

View reviewed changes

st-pasha reviewed Feb 16, 2018

View reviewed changes

mattdowle merged commit 2e83d3c into master Feb 16, 2018

mattdowle deleted the jumpsync branch February 16, 2018 00:58

st-pasha mentioned this pull request Mar 4, 2018

Line numbers in error messages should refer to the raw file #2428

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better jump sync and run-on #2627

Better jump sync and run-on #2627

mattdowle commented Feb 14, 2018 •

edited

codecov-io commented Feb 14, 2018 •

edited

st-pasha left a comment

st-pasha Feb 16, 2018

mattdowle Feb 16, 2018

HughParsonage Feb 16, 2018

MichaelChirico Feb 16, 2018 •

edited

Better jump sync and run-on #2627

Better jump sync and run-on #2627

Conversation

mattdowle commented Feb 14, 2018 • edited

codecov-io commented Feb 14, 2018 • edited

Codecov Report

st-pasha left a comment

Choose a reason for hiding this comment

st-pasha Feb 16, 2018

Choose a reason for hiding this comment

mattdowle Feb 16, 2018

Choose a reason for hiding this comment

HughParsonage Feb 16, 2018

Choose a reason for hiding this comment

MichaelChirico Feb 16, 2018 • edited

Choose a reason for hiding this comment

mattdowle commented Feb 14, 2018 •

edited

codecov-io commented Feb 14, 2018 •

edited

MichaelChirico Feb 16, 2018 •

edited