Some mechanism for queuing batch source partition reads #381

davidselassie · 2024-01-22T19:35:26Z

An optional follow on to #380

Currently, Bytewax FixedPartitionedSource input operators concurrently read from all source partitions on every activation. This makes sense for the "streaming source" use case, as delaying reading from a partition is undesirable because usually all partitions are approximately dealing with the same watermark of data.

For batch processing, though, this can result in some undesirable situations. One example of this is using DirSource with a directory of file_count >> worker_count. This means that a bazillion files might be opened simultaneously and the full batch_size read from each one. This makes tuning the batch_size parameter more confusing and could lead to resource exhaustion. This would also add a safety on the parameter added in #380 , if you pick a "too small" line count and a large number of partitions are activated, you don't trigger the undesirable effects above.

Possible Implementation

I'd suggest adding a FixedSourcePartition.concurrency_mode() abstract method which returns either "STREAM" or "BATCH". If it returns "STREAM" the current behavior is preserved. If it returns "BATCH" a new type of input operator logic is activated:

Only one partition that is primary on that worker is opened at a time.
That partition is read to EOF. It must EOF. It is a contract violation for the partition to not EOF.
Once the partition is complete, pick another partition for which the worker is primary and open it and repeat.

Caveats

It does introduce another way to incorrectly write a source: set "BATCH" and never EOF.

I am having trouble thinking of an input source that might have some streaming partitions and some batch partitions. But if one exists, this might not be the right design.

This does not need to be introduced on DynamicSource because there are not multiple partitions within.

The text was updated successfully, but these errors were encountered:

davidselassie added the type:feature New feature request label Jan 22, 2024

github-actions bot added the needs triage New issue, needs triage label Jan 22, 2024

davidselassie changed the title ~~Some mechanism for queuing source partition reads~~ Some mechanism for queuing batch source partition reads Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some mechanism for queuing batch source partition reads #381

Some mechanism for queuing batch source partition reads #381

davidselassie commented Jan 22, 2024 •

edited

Some mechanism for queuing batch source partition reads #381

Some mechanism for queuing batch source partition reads #381

Comments

davidselassie commented Jan 22, 2024 • edited

Possible Implementation

Caveats

davidselassie commented Jan 22, 2024 •

edited