Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some mechanism for queuing batch source partition reads #381

Open
davidselassie opened this issue Jan 22, 2024 · 0 comments
Open

Some mechanism for queuing batch source partition reads #381

davidselassie opened this issue Jan 22, 2024 · 0 comments
Labels
needs triage New issue, needs triage type:feature New feature request

Comments

@davidselassie
Copy link
Contributor

davidselassie commented Jan 22, 2024

An optional follow on to #380

Currently, Bytewax FixedPartitionedSource input operators concurrently read from all source partitions on every activation. This makes sense for the "streaming source" use case, as delaying reading from a partition is undesirable because usually all partitions are approximately dealing with the same watermark of data.

For batch processing, though, this can result in some undesirable situations. One example of this is using DirSource with a directory of file_count >> worker_count. This means that a bazillion files might be opened simultaneously and the full batch_size read from each one. This makes tuning the batch_size parameter more confusing and could lead to resource exhaustion. This would also add a safety on the parameter added in #380 , if you pick a "too small" line count and a large number of partitions are activated, you don't trigger the undesirable effects above.

Possible Implementation

I'd suggest adding a FixedSourcePartition.concurrency_mode() abstract method which returns either "STREAM" or "BATCH". If it returns "STREAM" the current behavior is preserved. If it returns "BATCH" a new type of input operator logic is activated:

  • Only one partition that is primary on that worker is opened at a time.

  • That partition is read to EOF. It must EOF. It is a contract violation for the partition to not EOF.

  • Once the partition is complete, pick another partition for which the worker is primary and open it and repeat.

Caveats

It does introduce another way to incorrectly write a source: set "BATCH" and never EOF.

I am having trouble thinking of an input source that might have some streaming partitions and some batch partitions. But if one exists, this might not be the right design.

This does not need to be introduced on DynamicSource because there are not multiple partitions within.

@davidselassie davidselassie added the type:feature New feature request label Jan 22, 2024
@github-actions github-actions bot added the needs triage New issue, needs triage label Jan 22, 2024
@davidselassie davidselassie changed the title Some mechanism for queuing source partition reads Some mechanism for queuing batch source partition reads Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage New issue, needs triage type:feature New feature request
Projects
None yet
Development

No branches or pull requests

1 participant