You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! The convert function is a very simple wrapper around the read and write operations of the individual file types. For filetypes with either chunked APIs or with a skip and max_rows parameter it would be possible to read in parts of the file in parallel and write separate output files, or store individual output files in memory and then combine them and write at once (writing in parallel is likely more tricky). There are a lot of cases where this would provide noticeable speedups at the cost of more cpu usage and memory usage (like csv to dta) but other times where it either doesn't make sense at all, or does not increase performance (for data that can't be chunked). Nevertheless, since the majority of the data used by R users follows the basic row-column specification, this would work for a lot of useful datatypes. I think this could be implemented with something as simple as an n_workers argument and future.apply in the background. I would love to hear thoughts on the suggestion!
The text was updated successfully, but these errors were encountered:
Hello! The
convert
function is a very simple wrapper around the read and write operations of the individual file types. For filetypes with either chunked APIs or with askip
andmax_rows
parameter it would be possible to read in parts of the file in parallel and write separate output files, or store individual output files in memory and then combine them and write at once (writing in parallel is likely more tricky). There are a lot of cases where this would provide noticeable speedups at the cost of more cpu usage and memory usage (likecsv
todta
) but other times where it either doesn't make sense at all, or does not increase performance (for data that can't be chunked). Nevertheless, since the majority of the data used by R users follows the basic row-column specification, this would work for a lot of useful datatypes. I think this could be implemented with something as simple as ann_workers
argument andfuture.apply
in the background. I would love to hear thoughts on the suggestion!The text was updated successfully, but these errors were encountered: