Handle 1.#INF, 1.#IND, 1.#QNAN and 1.#SNAN #1788

j0r1 · 2016-08-02T07:37:47Z

When a CSV file was created using a program compiled with the Visual Studio compiler, instead of inf and nan strings like 1.#INF, 1.#IND, 1.#QNAN and 1.#SNAN will be written. This patch is intended to be able to handle these strings as well, otherwise entire columns will be interpreted as text instead of numbers.

Since extra checks are done for each double, the modified code is slightly slower. Using a very large CSV file of roughly 2GB, containing only floating point numbers, the new code read it in 35.1 seconds, whereas the unmodified version got 34.3 seconds. This indicates a slowdown of 2.2%

jangorecki · 2016-08-02T10:50:44Z

Interesting PR. If it makes a slow down, even of 2%, I would make it optional. Could be even an option "datatable.fread.nonfinite" etc.
If you don't care about NaN and it is fine to read it as NA then you should be able to use na.strings=c("","NA","1.#IND","1.#QNAN","1.#SNAN") argument in fread.
Regarding Inf, it makes more sense to have inf.strings argument. The best would be to have it discussed, so if you could please open an issue where we could discuss that details before you start to incorporate any feedback.

j0r1 · 2016-08-04T17:58:06Z

Thanks for the feedback, I've opened an issue

codecov-io · 2016-09-10T19:09:05Z

Current coverage is 90.26% (diff: 88.63%)

Merging #1788 into master will decrease coverage by 0.30%

@@             master      #1788   diff @@
==========================================
  Files            58         59     +1   
  Lines         10714      10750    +36   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits           9704       9704          
- Misses         1010       1046    +36   
  Partials          0          0

Powered by Codecov. Last update f23c593...bde8d95

j0r1 · 2016-09-10T19:16:39Z

I've modified the code somewhat so it's now optional. To avoid if-tests at a low-level in the code, which would slow things down, I've moved some code to a templace C file, which is then included into fread.c. It is actually included two times, and by setting a few defines before each include, two versions of the readfile function are created: one is the default, as it is in the master branch, and the other will perform the extra checks to handle things like 1.#INF..

In the R implementation of fread, the user can now specify a flag to indicate which readfile function should be used.

MichaelChirico · 2016-09-10T22:40:04Z

@j0r1 be sure to squash your commits; also check the Travis log since something is awry. Thanks for the PR!

j0r1 · 2016-09-11T12:19:00Z

@MichaelChirico Can I still squash commits inside this PR? Or should I just close this, squash them and open a new PR?

The travis log is ok now, but two other checks are failing: I'm not sure why but I'm guessing because too much changes were detected? To make this feature selectable by a flag in the fread parameters, I've moved some code to a different file, which is then included two times, after setting some defines. This was the only way I could think of to do this that doesn't cause performance to go down in the default case (and that doesn't involve copy-pasting of a large part of the code).

MichaelChirico · 2016-09-11T13:55:42Z

Yes, if you squash and push --force back onto the same branch in your fork, it will automatically update here.

I've never seen codecov fail before... not sure what to tell you there

j0r1 · 2016-09-11T15:04:05Z

Thanks for the tip! Now it should be squashed into a single commit.

Is the codecov failure a fundamental problem?

When using the Visual Studio compiler, text representations 1.#INF, 1.#IND, 1.#QNAN and 1.#SNAN are used instead of 'inf' and 'NaN'. This patch recognizes them if a parameter for fread (R version) is set to TRUE. In the C code, the functions Strtod and readfile were moved to fread_readfile_template.h. Some functions were renamed, e.g. Strtod was renamed to TEMPLATE_Strtod. By setting defines in fread.c and including fread_readfile_template.h, two versions of the readfile function are created: - one uses the default code - the other uses strtod_wrapper and strtold_wrapper, which add extra checks to handle e.g. 1.#INF and 1.#IND Depending on the flag vs.inf.nan in the R fread function, one of the two C functions is called. This way, the original behaviour is still the default, and runs without any performance penalty. If needed, the user can activate the slightly slower modified code which performs extra checks.

mattdowle · 2017-08-07T22:14:23Z

Thanks for the PR and really sorry for not keeping up at the time, a year ago now.

fread.c has gone through a lot of change recently (parallelized and bare/full readers) so merging the PR with current master would be daunting, plus a newer automatic approach is possible in the new code structure. The issue #1800 you filed is safely on the fread master list #2247.

j0r1 mentioned this pull request Aug 4, 2016

Handle 1.#INF, 1.#IND, 1.#QNAN and 1.#SNAN #1800

Closed

j0r1 force-pushed the visualstudio_inf_nan branch from 0b25e92 to 681de85 Compare September 11, 2016 14:28

j0r1 force-pushed the visualstudio_inf_nan branch from 681de85 to bde8d95 Compare October 7, 2016 21:23

st-pasha added the fread label Jun 29, 2017

mattdowle closed this Aug 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle 1.#INF, 1.#IND, 1.#QNAN and 1.#SNAN #1788

Handle 1.#INF, 1.#IND, 1.#QNAN and 1.#SNAN #1788

j0r1 commented Aug 2, 2016

jangorecki commented Aug 2, 2016

j0r1 commented Aug 4, 2016

codecov-io commented Sep 10, 2016 •

edited

j0r1 commented Sep 10, 2016

MichaelChirico commented Sep 10, 2016 •

edited

j0r1 commented Sep 11, 2016

MichaelChirico commented Sep 11, 2016

j0r1 commented Sep 11, 2016

mattdowle commented Aug 7, 2017

Handle 1.#INF, 1.#IND, 1.#QNAN and 1.#SNAN #1788

Handle 1.#INF, 1.#IND, 1.#QNAN and 1.#SNAN #1788

Conversation

j0r1 commented Aug 2, 2016

jangorecki commented Aug 2, 2016

j0r1 commented Aug 4, 2016

codecov-io commented Sep 10, 2016 • edited

Current coverage is 90.26% (diff: 88.63%)

j0r1 commented Sep 10, 2016

MichaelChirico commented Sep 10, 2016 • edited

j0r1 commented Sep 11, 2016

MichaelChirico commented Sep 11, 2016

j0r1 commented Sep 11, 2016

mattdowle commented Aug 7, 2017

codecov-io commented Sep 10, 2016 •

edited

MichaelChirico commented Sep 10, 2016 •

edited