You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{CSV,Dir,File}Source all currently use each entire file as the unit of parallelism and emit a single partition in list_parts for each file name. (With some trickery surrounding get_fs_id so that you can mark files with the same name as being unique if they're on different mounts.)
It would be a nice addition to add a feature which allows stateful parallel reading of a single file that is accessible from multiple workers. The core idea is to define a partition not as a whole file, but a specific chunk of lines in a file and allow multiple workers to declare they can read that.
Implementation
We could add an argument to these sources part_line_size. list_parts look at the files that worker has available, their sizes, and declares a partition for each chunk up to the size part_line_size. (Or use byte offsets instead of lines. There's some tricky implementation details here to find consistent line boundaries across chunks to accidentally prevent double reading a line.) Then in build_part, parse the partition to see what offsets in the file you should read.
Caveats
As with basically all performance based features, turning this knob up does not always result in more performance. It is best used in the case where worker_count >> file_count and every worker can read all files. Then you can get more parallelism.
The text was updated successfully, but these errors were encountered:
Inspired by #379
{CSV,Dir,File}Source
all currently use each entire file as the unit of parallelism and emit a single partition inlist_parts
for each file name. (With some trickery surroundingget_fs_id
so that you can mark files with the same name as being unique if they're on different mounts.)It would be a nice addition to add a feature which allows stateful parallel reading of a single file that is accessible from multiple workers. The core idea is to define a partition not as a whole file, but a specific chunk of lines in a file and allow multiple workers to declare they can read that.
Implementation
We could add an argument to these sources
part_line_size
.list_parts
look at the files that worker has available, their sizes, and declares a partition for each chunk up to the sizepart_line_size
. (Or use byte offsets instead of lines. There's some tricky implementation details here to find consistent line boundaries across chunks to accidentally prevent double reading a line.) Then inbuild_part
, parse the partition to see what offsets in the file you should read.Caveats
As with basically all performance based features, turning this knob up does not always result in more performance. It is best used in the case where
worker_count >> file_count
and every worker can read all files. Then you can get more parallelism.The text was updated successfully, but these errors were encountered: