You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Feature request, check the size of /dev/shm to see if it matches the mmap size. In my case I was using R inside a docker container and the default size for /dev/shm is 64M. When the code does a mmap on a piped stream (such as with hadoop fs -cat /myfile.csv) it will only read the shm size bytes from the pipe into mmap. It does not report an error via the C api which I suspect is normal. However, debugging why fread complained about the file format resulted in a deep dive into the R and C code of data.table to discover it uses this mechanism. The error reported (random message based on where my pipe happened to be cut off):
(ERROR):Expected sep (',') but ' ' ends field 0 when detecting types from point 10: 14183667
This can be reproduced by doing the following:
Build a file that is ~5 meg large than /dev/shm
Adjust the /dev/shm to something like 64M (this is the default for a Docker container)
Run fread on "cat ~/myfile.csv" <-- cat creates the pipe
Docker V1.12+
Centos latest image from docker hub
R-open v3.4.0 (microsoft)
In the code: https://github.com/Rdatatable/data.table/blob/master/src/fread.c
Around line 788 perhaps it should check the size of /dev/shm to see if it matches the file it just read into memory. In my case in docker here is the verbose output of the failed test condition:
> dat.df3<-fread("/opt/cloudera/parcels/CDH/bin/hadoop fs -cat /user/tcederquist/tim_pop_comm_14_5 | head -3668850" ,sep=",", header=TRUE, verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.062500 GB.
Memory mapping ... ok
....basic output
Type codes (point 8): 1111
Type codes (point 9): 1111
(ERROR):Expected sep (',') but ' ' ends field 0 when detecting types from point 10: 14183667
Expected results:
File opened, filesize is 0.373524 GB.
In addition, when it fails it knows internally the line # and other useful information when it failed. I had to zero in on the value by hand before I discovered the verbose flag. Would be nice if normal error messages indicated the row #. vebose=T shows the location when it calculates the # of delimiters such as for this test case would have be useful output on error (since I knew the file had 20M records):
Count of eol: 3668839 (including 0 at the end)
nrow = MIN( nsep [11006514] / (ncol [4] -1), neol [3668839] - endblanks [0] ) = 3668838
The text was updated successfully, but these errors were encountered:
For anyone finding this same issue, the short term fix is to increase the shared memory size of the container or in your OS if your /dev/shm is too small. Typical modern OS's use 50% of your available memory. In my 64G amazon ec2 instance it I set the docker container to use:
Agreed. Sorry about that.
A recent change in dev is this from news :
Ram disk (/dev/shm) is no longer used for the output of system command input. Although faster when it worked, it was causing too many device full errors; e.g., #1139 and zUMIs/19. Thanks to Kyle Chung for reporting. Standard tempdir() is now used. If you wish to use ram disk, set TEMPDIR to /dev/shm; see ?tempdir.
Please try dev 1.10.5 and open a new issue if it's still a problem.
Feature request, check the size of /dev/shm to see if it matches the mmap size. In my case I was using R inside a docker container and the default size for /dev/shm is 64M. When the code does a mmap on a piped stream (such as with hadoop fs -cat /myfile.csv) it will only read the shm size bytes from the pipe into mmap. It does not report an error via the C api which I suspect is normal. However, debugging why fread complained about the file format resulted in a deep dive into the R and C code of data.table to discover it uses this mechanism. The error reported (random message based on where my pipe happened to be cut off):
This can be reproduced by doing the following:
In the code: https://github.com/Rdatatable/data.table/blob/master/src/fread.c
Around line 788 perhaps it should check the size of /dev/shm to see if it matches the file it just read into memory. In my case in docker here is the verbose output of the failed test condition:
Expected results:
In addition, when it fails it knows internally the line # and other useful information when it failed. I had to zero in on the value by hand before I discovered the verbose flag. Would be nice if normal error messages indicated the row #. vebose=T shows the location when it calculates the # of delimiters such as for this test case would have be useful output on error (since I knew the file had 20M records):
The text was updated successfully, but these errors were encountered: