How do we set `limit`? #890

bray · 2023-09-14T18:45:23Z

When collection returns an ActiveRecord::Relation, how do we set limit for each "batch"? job-iteration's Best Practices says it defaults to 100. 100 is too small for expensive queries and/or large datasets.

It looks like that's passed in here.

But for maintenance_tasks, in the collection method, we're not explicitly building an enumerator. So how can we pass in or set the @batch_size? I've tried some guesses like @batch_size, @of, def batch_size, def of inside the MaintenanceTasks::Task.

I also tried returning a few different classes in the collection method, but none have worked:

#find_each(batch_size: 2000) gets us what we want, but this returns an Enumerator and #collection can't be an Enumerator (it gives the error "#collection must be either an Active Record Relation, ActiveRecord::Batches::BatchEnumerator, Array, or CSV.").

#in_batches(batch_size: 2000) works (it returns an ActiveRecord::Batches::BatchEnumerator), but this is only if you want to process a batch in process, rather than individual items.

#in_batches(batch_size: 2000).each_record also returns an Enumerator.

The text was updated successfully, but these errors were encountered:

nvasilevski · 2023-09-14T19:57:00Z

I'd say this is purely a missing API as there is no reason not to allow tasks to redefine the batch size that is being used under the hood. The batch size will need to be passed to the active_record_on_records enumerator using the batch_size: option

maintenance_tasks/app/jobs/concerns/maintenance_tasks/task_job_concern.rb

Line 42 in e1765da

enumerator_builder.active_record_on_records(collection, cursor: cursor)

So I'll try to put it on our roadmap or PRs are always welcomed! Unless someone else from maintainers remembers/has concerns on why this feature should not exist

The only requirement for this feature is that the tasks shouldn't be forced to always specify the batch size and it still should default to 100 (or to nil which means we'll use the default from job-iteration)

nvasilevski · 2023-09-14T20:09:37Z

100 is too small for expensive queries and/or large datasets.

I'm curious if you have any examples where a query may benefit from a larger batch size. Or is it just about reducing the number of queries performed by the task overall.
It is irrelevant to the need of such API, it's just we often hear the opposite where people are trying to lower the batch size to make sure that database searches for less records per query and returns the result quicker.

I also just realized that we need to be careful and make sure people don't use way to big batch sizes because for example if the task uses batches of 1000 to load records but each process method call takes 1s to finish the task will never go through whole batch before being interrupted due to default interruption is configured to happen in 120s so tasks essentially will be throwing most of the loaded records while only processing ~120 of 1000 loaded

UPD:
Not sure why I assumed that job-iteration sets max_job_runtime to 2 minutes by default but the point is still valid. batch_size should be aligned to fit max_job_runtime to avoid loading records that won't be processed in a given run

bray · 2023-09-14T22:49:11Z

Ok, thanks - good to know I wasn't missing something. I'll try to get up a PR!

An example would be: say you have to process 50 million records. If you fetch 100 per query, that's 500,000 queries. As opposed to 50,000 or 10,000 if you fetch 1k or 5k per query. If the query takes negligibly longer for the larger batch size, it's probably better to do a larger batch size to greatly reduce the number of queries in a potentially small time window. This is presuming each process is, e.g. only updating the record so takes a max of tens of ms.

That's a good point re: max_job_runtime. To be honest, I've set that to infinity to disable the periodic interruptions. I understand the tasks should be interruptible, but don't really see the benefit of forcing interrupts every 5 minutes. The reasoning given here is to preserve worker capacity. But we'll be running these tasks in their own Sidekiq queue with dedicated workers, so that's not a concern for us.

So in our case, a big batch size is irrelevant. But I understand we have to keep that in mind for changes to the gem.

github-actions · 2024-01-25T01:47:47Z

This issue has been marked as stale because it has not been commented on in two months.
Please reply in order to keep the issue open. Otherwise, it will close in 14 days.
Thank you for contributing!

adrianna-chang-shopify · 2024-05-13T20:51:29Z

Maple and I hacked a bit on this during RailsConf Hack day and came up with a simple proof of concept: https://github.com/Shopify/maintenance_tasks/tree/custom-relation-limit-size

Thoughts on the API @nvasilevski @bray ? Opted for a task-level DSL collection_limit

nvasilevski · 2024-05-13T21:56:43Z

I'm a little worried that any name with _limit_ in it may mislead users into thinking that this is the max number of records the task is supposed to iterate over. Personally I would prefer having _batch_ in the name

But the prototype looks clean!

bray · 2024-05-13T22:53:12Z

I agree with Nikita. Looks good, but s/limit/batch_size/g :)

adrianna-chang-shopify · 2024-05-14T12:59:20Z

Naming is hard 😅 I was trying to stay clear of batch in the API initially because I was worried it would cause confusion with the batch relation enumerator. But that's the kwarg job-iteration uses so it's probably clearer than limit 👍 Okay, I'll open up a PR with some tests. Thanks for the feedback, y'all!

github-actions bot added the stale label Jan 25, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 9, 2024

nvasilevski reopened this Apr 25, 2024

nvasilevski added railsconf-hackday and removed stale labels Apr 25, 2024

adrianna-chang-shopify self-assigned this May 8, 2024

nvasilevski removed the railsconf-hackday label May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do we set `limit`? #890

How do we set `limit`? #890

bray commented Sep 14, 2023 •

edited

nvasilevski commented Sep 14, 2023

nvasilevski commented Sep 14, 2023 •

edited

bray commented Sep 14, 2023 •

edited

github-actions bot commented Jan 25, 2024

adrianna-chang-shopify commented May 13, 2024

nvasilevski commented May 13, 2024

bray commented May 13, 2024

adrianna-chang-shopify commented May 14, 2024

How do we set limit? #890

How do we set limit? #890

Comments

bray commented Sep 14, 2023 • edited

nvasilevski commented Sep 14, 2023

nvasilevski commented Sep 14, 2023 • edited

bray commented Sep 14, 2023 • edited

github-actions bot commented Jan 25, 2024

adrianna-chang-shopify commented May 13, 2024

nvasilevski commented May 13, 2024

bray commented May 13, 2024

adrianna-chang-shopify commented May 14, 2024

How do we set `limit`? #890

How do we set `limit`? #890

bray commented Sep 14, 2023 •

edited

nvasilevski commented Sep 14, 2023 •

edited

bray commented Sep 14, 2023 •

edited