Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread/fwrite *base data types* directly for efficiency #1656

Open
jangorecki opened this issue Apr 19, 2016 · 6 comments
Open

fread/fwrite *base data types* directly for efficiency #1656

jangorecki opened this issue Apr 19, 2016 · 6 comments

Comments

@jangorecki
Copy link
Member

jangorecki commented Apr 19, 2016

There could be an option for fwrite and fread to dump/read the unclassed representation of objects. In such a case whole IO process can be much faster, because we don't have to format POSIXct, [I]Date and other classes for writing to file, we just unclass them, and write numbers (or any other base type they use) to a file. Later when reading with fread we provide colClasses (or yaml), and those numbers (or another base types) are properly recognized as their original class.

library(data.table)
dt = data.table(a=as.Date("2015-01-01"), b=as.POSIXct("2015-01-01 15:30:20"))
fwrite(dt, "fwrite-tests.csv", AsIs=TRUE)
system("cat fwrite-tests.csv")
#a, b
#16436, 1420106420
fread("fwrite-tests.csv", colClasses = c(a="Date", b="POSIXct"))
#            a                   b
#       <Date>              <POSc>
#1: 2015-01-01 2015-01-01 15:30:20
@mattdowle mattdowle mentioned this issue Apr 20, 2016
15 tasks
@mattdowle mattdowle changed the title fread could read *base data types* directly for efficiency fread/fwrite *base data types* directly for efficiency Apr 20, 2016
@clarkdk
Copy link

clarkdk commented May 10, 2016

It would be awesome if fwrite by default wrote Date and POSIXct objects in standard formats (ie ISO 8601, or Lubridate) that fread would automatically recognize and read in as Date or POSIXct, without requiring colClasses to be specified for each datetime column. A default automatic approach would be really welcome since I spend a lot of programming time converting datetime character strings back to POSIXct when importing time series data, even when I have written the file myself with write.csv. It would presumably also require an assumption (default value) for tz on import, unless the default format included the time zone offset on every value. Perhaps a POSIXct.format="format-string" parameter could be included to specify the format string to use when writing and reading POSIXct values, since colClasses = c(a="Date", b="POSIXct") doesn't allow a specific format and/or timezone to be specified.

@jangorecki
Copy link
Member Author

@clarkdk You discuss quite a different issue, as it requires parsing strings that stores date/datetime. This issue is about date/datetime stored as numerics, and those can be effectively optimized for IO speed if fwrite and fread would be able to write and read those types as their numerics transparently (or using just colClasses). So no parsing is required. If you would like to speedup parsing strings then fasttime can help.

@clarkdk
Copy link

clarkdk commented May 10, 2016

@jangorecki OK, now I understand. I will make a separate FR. In implementing datetimes as numerics in fread/fwrite, please don't rule out a possible future feature involving writing and parsing of datetimes as formatted strings.

mattdowle added a commit that referenced this issue Nov 11, 2016
…ite.csv'. Closes #1664. Closes #1772. Closes the fwrite part of #1656.
@mattdowle mattdowle added this to the v1.9.10 milestone Nov 11, 2016
@mattdowle mattdowle modified the milestones: Candidate, v1.10.6 Mar 17, 2018
@mattdowle mattdowle modified the milestones: v1.11.0, v1.11.2 Apr 29, 2018
@mattdowle mattdowle modified the milestones: 1.12.0, 1.12.2 Jan 11, 2019
@jangorecki jangorecki modified the milestones: 1.12.2, 1.12.4 Jan 24, 2019
@jangorecki
Copy link
Member Author

now when have yaml support this feature is even more useful as we can maximise IO without trade-offs

@jangorecki jangorecki modified the milestones: 1.12.4, 1.13.0 Jul 25, 2019
@PavoDive
Copy link

Not sure if this belongs to this issue or to 3391 or to the master task 2247, but I could reproduce an unexpected behavior with fread and fwrite when yaml = TRUE:

dt = data.table(date = as.POSIXct(c("2006-05-01", "2006-05-02")), 
                 b = as.factor(c(1,2)), 
                 c = c(3,4))
print(class(dt$date))
fwrite(dt, "dt.csv", yaml = TRUE)

dt2 = fread("dt.csv", yaml = TRUE)
print(class(dt2$date))

See: https://stackoverflow.com/questions/58493926/fread-with-yaml-true-read-type-date-as-character

@MichaelChirico
Copy link
Member

@PavoDive could you file a new issue for that? please & thank you

@mattdowle mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants