You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, Bytewax FixedPartitionedSource input operators concurrently read from all source partitions on every activation. This makes sense for the "streaming source" use case, as delaying reading from a partition is undesirable because usually all partitions are approximately dealing with the same watermark of data.
For batch processing, though, this can result in some undesirable situations. One example of this is using DirSource with a directory of file_count >> worker_count. This means that a bazillion files might be opened simultaneously and the full batch_size read from each one. This makes tuning the batch_size parameter more confusing and could lead to resource exhaustion. This would also add a safety on the parameter added in #380 , if you pick a "too small" line count and a large number of partitions are activated, you don't trigger the undesirable effects above.
Possible Implementation
I'd suggest adding a FixedSourcePartition.concurrency_mode() abstract method which returns either "STREAM" or "BATCH". If it returns "STREAM" the current behavior is preserved. If it returns "BATCH" a new type of input operator logic is activated:
Only one partition that is primary on that worker is opened at a time.
That partition is read to EOF. It must EOF. It is a contract violation for the partition to not EOF.
Once the partition is complete, pick another partition for which the worker is primary and open it and repeat.
Caveats
It does introduce another way to incorrectly write a source: set "BATCH" and never EOF.
I am having trouble thinking of an input source that might have some streaming partitions and some batch partitions. But if one exists, this might not be the right design.
This does not need to be introduced on DynamicSource because there are not multiple partitions within.
The text was updated successfully, but these errors were encountered:
An optional follow on to #380
Currently, Bytewax
FixedPartitionedSource
input operators concurrently read from all source partitions on every activation. This makes sense for the "streaming source" use case, as delaying reading from a partition is undesirable because usually all partitions are approximately dealing with the same watermark of data.For batch processing, though, this can result in some undesirable situations. One example of this is using
DirSource
with a directory offile_count >> worker_count
. This means that a bazillion files might be opened simultaneously and the fullbatch_size
read from each one. This makes tuning thebatch_size
parameter more confusing and could lead to resource exhaustion. This would also add a safety on the parameter added in #380 , if you pick a "too small" line count and a large number of partitions are activated, you don't trigger the undesirable effects above.Possible Implementation
I'd suggest adding a
FixedSourcePartition.concurrency_mode()
abstract method which returns either"STREAM"
or"BATCH"
. If it returns"STREAM"
the current behavior is preserved. If it returns"BATCH"
a new type of input operator logic is activated:Only one partition that is primary on that worker is opened at a time.
That partition is read to EOF. It must EOF. It is a contract violation for the partition to not EOF.
Once the partition is complete, pick another partition for which the worker is primary and open it and repeat.
Caveats
It does introduce another way to incorrectly write a source: set
"BATCH"
and never EOF.I am having trouble thinking of an input source that might have some streaming partitions and some batch partitions. But if one exists, this might not be the right design.
This does not need to be introduced on
DynamicSource
because there are not multiple partitions within.The text was updated successfully, but these errors were encountered: