Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chunked filter #344

Open
rafalkrupinski opened this issue Jul 4, 2023 · 3 comments
Open

chunked filter #344

rafalkrupinski opened this issue Jul 4, 2023 · 3 comments

Comments

@rafalkrupinski
Copy link

Is boltons a good home for a filter_chunked function?

def filter_chunked(iterable: Iterable[T], chunk_size: int, func: Callable[[Iterable[T]], Iterable[bool]]) -> Iterator[T]:
    for chunk in boltons.iterutils.chunked(iterable, chunk_size):
        for item, allow in zip(chunk, func(chunk)):
            if allow:
                yield item

It behaves similar to filter(), except the predicate is called with chunks of input. It's useful the predicate must make calls to external services, but can do it in batches.

@mahmoud
Copy link
Owner

mahmoud commented Jul 6, 2023

Hey Raphael! So, filter_chunked is certainly adjacent to some things I've done in the past. Just at first blush, I wondered how this might be rewritten as a composition of chunked_iter, zip, and filter. And it turned out to be pretty annoying!

(something like:

from boltons.iterutils import chunked_iter

key = lambda chunk: [bool(x%2) for x in chunk]
vals = list(range(20))
chunk_size = 5

filtered_chunked = (x for chunk in chunked_iter(vals, chunk_size) for x, allow in zip(chunk, key(chunk)) if allow)

# list(filtered_chunked) -> [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

If it's this annoying, I'm generally for including it. What do you think of the simplified expression above? You'll definitely want to use chunked_iter if you're going to have make a generator. We may need to name it filter_chunked_iter, too.

The input iterable is flat, and the output iterable is flat, so that dimensionality matches, and there's no need for a value_transform convenience like with bucketize.

We'll also need a nice, demonstrative docstring. What's your motivating example in this case? Are there a lot of APIs where this batch nature applies?

Thanks again!

@rafalkrupinski
Copy link
Author

What do you think of the simplified expression above?

I'm happy with a generator expression, I'd just split it in lines and use a nicer variable name for my PR.
I wouldn't use key as the function-param name, bc it suggests it returns some inherent property of the object (like in sorted). I did use just func above but I think something batch_pred or chunk_pred (predicate) would suit.

What's your motivating example in this case?

My code is looking for resources that exist in one API but not in another one. Both APIs use paging, have limits on query length, and allow for property in $values semantics.

The code goes over an iterator (generator) of results of the first API (paged, but that's hidden in the generator), takes the objects' ids and searches them in the second API. Whatever is returned can be discarded from the input as processed in previous runs. Searching step is wrapped in a batch predicate and used in chunked_filter.

One alternative would collecting all the objects in multiple fetches, but that would only increase memory footprint, and still the list would need to be chunked for the second API.

Another alternative would be fetching each resource in a separate query, but that would add time overhead from IO and API rate limits.

Are there a lot of APIs where this batch nature applies?

Not sure how to answer your question.
One notable example would be SQL. My project is using AirTable, which is a database wannabe with REST API and a query (formula) language.

@rafalkrupinski
Copy link
Author

We'll also need a nice, demonstrative docstring.

Something like

https://thedailywtf.com/articles/Very,_Very_Well_Documented

rafalkrupinski pushed a commit to rafalkrupinski/boltons that referenced this issue Jul 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants