Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread may fail to parse valid file when dec=',' #2750

Closed
st-pasha opened this issue Apr 13, 2018 · 5 comments
Closed

fread may fail to parse valid file when dec=',' #2750

st-pasha opened this issue Apr 13, 2018 · 5 comments

Comments

@st-pasha
Copy link
Contributor

st-pasha commented Apr 13, 2018

> fread('a,b,c,d\n1e1,1e2,1e3,"4,0001"\n1,2,3,4\n', dec=',')
       a     b     c      d
   <num> <num> <num>  <num>
1:    10   100  1000 4.0001
Warning message:
In fread("a,b,c,d\n1e1,1e2,1e3,\"4,0001\"\n1,2,3,4\n", dec = ",") :
  Discarded single-line footer: <<1,2,3,4>>

This happens because float parser greedily consumes 1,2 as a single token, whereas without quotes it must be parsed as 2 separate fields.

In addition, the "Details" section in documentation has the following information (which is long since being outdated):

‘fread’ uses C function ‘strtod’ to read numeric data; e.g.,
‘1.23’ or ‘1,23’. ‘strtod’ retrieves the decimal separator (‘.’ or
‘,’ usually) from the locale of the R session rather than as an
argument passed to the ‘strtod’ function. So for
‘fread(...,dec=",")’ to work, ‘fread’ changes this (and only this)
R session's locale temporarily to a locale which provides the
desired decimal separator.

@Atrebas
Copy link

Atrebas commented Apr 14, 2018

Awesome package! Thank you.
Coincidentally, I experienced a possibly related strange behaviour with fread yesterday.
Here is a simple example (derived from a real case using lapply fread on a bunch of files).

library(data.table)

DT = data.table(A = rep("20,1", 1e4))
fwrite(DT, "DT.csv", quote = FALSE)

classA = character(1e3)

for (i in seq_along(classA)) {
  
  DT = fread("DT.csv", sep = ";", dec = ",", colClasses = "numeric")
  classA[i] = DT[, class(A)]
}

table(classA)

This gives me:

Warning message:
In fread("DT.csv", sep = ";", dec = ",", colClasses = "numeric",  :
  Bumped column 1 to type character on data row 387, field contains '20,1'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses a sample of 1,000 rows (100 rows at 10 points) so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.

table(classA)
# classA
# character   numeric 
#        1       999 
which(classA == "character")
# [1] 9

So, sometimes, the column is read as character with the spotted row index being different for different runs. And the which indicates it is more likely to happen in the first iterations. I don't get the randomness and the fact that colClasses is ignored (I set colClasses after reading fread doc...).
And... I did not manage to reproduce the error when setting verbose = TRUE... Also, the bug was observed using RStudio, not reproduced in the R console...
Sorry if I missed or misunderstood something...

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4-3

loaded via a namespace (and not attached):
[1] tools_3.3.2 yaml_2.1.18

@MichaelChirico
Copy link
Member

MichaelChirico commented Apr 14, 2018 via email

@Atrebas
Copy link

Atrebas commented Apr 14, 2018

Indeed... Ran the same example a couple of time with 1.10.5 and it worked fine.
Thank you.

@ben-schwen
Copy link
Member

Confirming that this was fixed with #4495

@MichaelChirico
Copy link
Member

thanks for checking!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants