Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do we set limit? #890

Open
bray opened this issue Sep 14, 2023 Discussed in #886 · 8 comments
Open

How do we set limit? #890

bray opened this issue Sep 14, 2023 Discussed in #886 · 8 comments
Assignees

Comments

@bray
Copy link

bray commented Sep 14, 2023

When collection returns an ActiveRecord::Relation, how do we set limit for each "batch"? job-iteration's Best Practices says it defaults to 100. 100 is too small for expensive queries and/or large datasets.

It looks like that's passed in here.

But for maintenance_tasks, in the collection method, we're not explicitly building an enumerator. So how can we pass in or set the @batch_size? I've tried some guesses like @batch_size, @of, def batch_size, def of inside the MaintenanceTasks::Task.

I also tried returning a few different classes in the collection method, but none have worked:

#find_each(batch_size: 2000) gets us what we want, but this returns an Enumerator and #collection can't be an Enumerator (it gives the error "#collection must be either an Active Record Relation, ActiveRecord::Batches::BatchEnumerator, Array, or CSV.").

#in_batches(batch_size: 2000) works (it returns an ActiveRecord::Batches::BatchEnumerator), but this is only if you want to process a batch in process, rather than individual items.

#in_batches(batch_size: 2000).each_record also returns an Enumerator.

@nvasilevski
Copy link
Contributor

I'd say this is purely a missing API as there is no reason not to allow tasks to redefine the batch size that is being used under the hood. The batch size will need to be passed to the active_record_on_records enumerator using the batch_size: option

enumerator_builder.active_record_on_records(collection, cursor: cursor)

So I'll try to put it on our roadmap or PRs are always welcomed! Unless someone else from maintainers remembers/has concerns on why this feature should not exist

The only requirement for this feature is that the tasks shouldn't be forced to always specify the batch size and it still should default to 100 (or to nil which means we'll use the default from job-iteration)

@nvasilevski
Copy link
Contributor

nvasilevski commented Sep 14, 2023

100 is too small for expensive queries and/or large datasets.

I'm curious if you have any examples where a query may benefit from a larger batch size. Or is it just about reducing the number of queries performed by the task overall.
It is irrelevant to the need of such API, it's just we often hear the opposite where people are trying to lower the batch size to make sure that database searches for less records per query and returns the result quicker.

I also just realized that we need to be careful and make sure people don't use way to big batch sizes because for example if the task uses batches of 1000 to load records but each process method call takes 1s to finish the task will never go through whole batch before being interrupted due to default interruption is configured to happen in 120s so tasks essentially will be throwing most of the loaded records while only processing ~120 of 1000 loaded

UPD:
Not sure why I assumed that job-iteration sets max_job_runtime to 2 minutes by default but the point is still valid. batch_size should be aligned to fit max_job_runtime to avoid loading records that won't be processed in a given run

@bray
Copy link
Author

bray commented Sep 14, 2023

Ok, thanks - good to know I wasn't missing something. I'll try to get up a PR!

An example would be: say you have to process 50 million records. If you fetch 100 per query, that's 500,000 queries. As opposed to 50,000 or 10,000 if you fetch 1k or 5k per query. If the query takes negligibly longer for the larger batch size, it's probably better to do a larger batch size to greatly reduce the number of queries in a potentially small time window. This is presuming each process is, e.g. only updating the record so takes a max of tens of ms.

That's a good point re: max_job_runtime. To be honest, I've set that to infinity to disable the periodic interruptions. I understand the tasks should be interruptible, but don't really see the benefit of forcing interrupts every 5 minutes. The reasoning given here is to preserve worker capacity. But we'll be running these tasks in their own Sidekiq queue with dedicated workers, so that's not a concern for us.

So in our case, a big batch size is irrelevant. But I understand we have to keep that in mind for changes to the gem.

Copy link

This issue has been marked as stale because it has not been commented on in two months.
Please reply in order to keep the issue open. Otherwise, it will close in 14 days.
Thank you for contributing!

@github-actions github-actions bot added the stale label Jan 25, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 9, 2024
@nvasilevski nvasilevski reopened this Apr 25, 2024
@adrianna-chang-shopify adrianna-chang-shopify self-assigned this May 8, 2024
@adrianna-chang-shopify
Copy link
Contributor

Maple and I hacked a bit on this during RailsConf Hack day and came up with a simple proof of concept: https://github.com/Shopify/maintenance_tasks/tree/custom-relation-limit-size

Thoughts on the API @nvasilevski @bray ? Opted for a task-level DSL collection_limit

@nvasilevski
Copy link
Contributor

I'm a little worried that any name with _limit_ in it may mislead users into thinking that this is the max number of records the task is supposed to iterate over. Personally I would prefer having _batch_ in the name

But the prototype looks clean!

@bray
Copy link
Author

bray commented May 13, 2024

I agree with Nikita. Looks good, but s/limit/batch_size/g :)

@adrianna-chang-shopify
Copy link
Contributor

Naming is hard 😅 I was trying to stay clear of batch in the API initially because I was worried it would cause confusion with the batch relation enumerator. But that's the kwarg job-iteration uses so it's probably clearer than limit 👍 Okay, I'll open up a PR with some tests. Thanks for the feedback, y'all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants