Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle 1.#INF, 1.#IND, 1.#QNAN and 1.#SNAN #1788

Closed
wants to merge 1 commit into from

Conversation

j0r1
Copy link

@j0r1 j0r1 commented Aug 2, 2016

When a CSV file was created using a program compiled with the Visual Studio compiler, instead of inf and nan strings like 1.#INF, 1.#IND, 1.#QNAN and 1.#SNAN will be written. This patch is intended to be able to handle these strings as well, otherwise entire columns will be interpreted as text instead of numbers.

Since extra checks are done for each double, the modified code is slightly slower. Using a very large CSV file of roughly 2GB, containing only floating point numbers, the new code read it in 35.1 seconds, whereas the unmodified version got 34.3 seconds. This indicates a slowdown of 2.2%

@jangorecki
Copy link
Member

Interesting PR. If it makes a slow down, even of 2%, I would make it optional. Could be even an option "datatable.fread.nonfinite" etc.
If you don't care about NaN and it is fine to read it as NA then you should be able to use na.strings=c("","NA","1.#IND","1.#QNAN","1.#SNAN") argument in fread.
Regarding Inf, it makes more sense to have inf.strings argument. The best would be to have it discussed, so if you could please open an issue where we could discuss that details before you start to incorporate any feedback.

@j0r1
Copy link
Author

j0r1 commented Aug 4, 2016

Thanks for the feedback, I've opened an issue

@codecov-io
Copy link

codecov-io commented Sep 10, 2016

Current coverage is 90.26% (diff: 88.63%)

Merging #1788 into master will decrease coverage by 0.30%

@@             master      #1788   diff @@
==========================================
  Files            58         59     +1   
  Lines         10714      10750    +36   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits           9704       9704          
- Misses         1010       1046    +36   
  Partials          0          0          

Powered by Codecov. Last update f23c593...bde8d95

@j0r1
Copy link
Author

j0r1 commented Sep 10, 2016

I've modified the code somewhat so it's now optional. To avoid if-tests at a low-level in the code, which would slow things down, I've moved some code to a templace C file, which is then included into fread.c. It is actually included two times, and by setting a few defines before each include, two versions of the readfile function are created: one is the default, as it is in the master branch, and the other will perform the extra checks to handle things like 1.#INF..

In the R implementation of fread, the user can now specify a flag to indicate which readfile function should be used.

@MichaelChirico
Copy link
Member

MichaelChirico commented Sep 10, 2016

@j0r1 be sure to squash your commits; also check the Travis log since something is awry. Thanks for the PR!

@j0r1
Copy link
Author

j0r1 commented Sep 11, 2016

@MichaelChirico Can I still squash commits inside this PR? Or should I just close this, squash them and open a new PR?

The travis log is ok now, but two other checks are failing: I'm not sure why but I'm guessing because too much changes were detected? To make this feature selectable by a flag in the fread parameters, I've moved some code to a different file, which is then included two times, after setting some defines. This was the only way I could think of to do this that doesn't cause performance to go down in the default case (and that doesn't involve copy-pasting of a large part of the code).

@MichaelChirico
Copy link
Member

Yes, if you squash and push --force back onto the same branch in your fork, it will automatically update here.

I've never seen codecov fail before... not sure what to tell you there

@j0r1
Copy link
Author

j0r1 commented Sep 11, 2016

Thanks for the tip! Now it should be squashed into a single commit.

Is the codecov failure a fundamental problem?

When using the Visual Studio compiler, text representations
1.#INF, 1.#IND, 1.#QNAN and 1.#SNAN are used instead of 'inf' and
'NaN'. This patch recognizes them if a parameter for fread
(R version) is set to TRUE.

In the C code, the functions Strtod and readfile were moved to
fread_readfile_template.h. Some functions were renamed, e.g.
Strtod was renamed to TEMPLATE_Strtod. By setting defines in
fread.c and including fread_readfile_template.h, two versions
of the readfile function are created:

- one uses the default code
- the other uses strtod_wrapper and strtold_wrapper, which add
extra checks to handle e.g. 1.#INF and 1.#IND

Depending on the flag vs.inf.nan in the R fread function, one of
the two C functions is called.

This way, the original behaviour is still the default, and runs
without any performance penalty. If needed, the user can activate
the slightly slower modified code which performs extra checks.
@mattdowle
Copy link
Member

Thanks for the PR and really sorry for not keeping up at the time, a year ago now.

fread.c has gone through a lot of change recently (parallelized and bare/full readers) so merging the PR with current master would be daunting, plus a newer automatic approach is possible in the new code structure. The issue #1800 you filed is safely on the fread master list #2247.

@mattdowle mattdowle closed this Aug 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants