Deprecate BatchOver in favor of in_batches(use_ranges: true) #136

maximerety · 2024-02-22T21:06:30Z

Starting from ActiveRecord 7.1, there's a built-in helper equivalent to what BatchOver does, let's use it instead of maintaining our own implementation forever.

We need to keep BatchOver as long as compatibility with ActiveRecord < 7.1 is maintained though.

See:

If using ActiveRecord 7.1 or later, we would use the recommended built-in method in_batches with the use_ranges: true option, e.g.

User.in_batches(of: 100, use_ranges: true).each { |batch| ... }

Otherwise, we would still use BatchOver as a fallback:

SafePgMigrations::Helpers::BatchOver.new(User, of: 100).each_batch { |batch| ... }

Note that although both helpers are almost equivalent, there are small differences in the queries generated.

With the example code above, and assuming the users tables contains 250 records, we would have with BatchOver:

/* Get batch #1 */
SELECT "users".* FROM "users" ORDER BY "users"."id" ASC LIMIT 1
SELECT "users".* FROM "users" ORDER BY "users"."id" ASC LIMIT 1 OFFSET 100
SELECT "users".* FROM "users" WHERE "users"."id" >= 1 AND "users"."id" < 101 ORDER BY "users"."id" ASC
/* Do something with result */

/* Get batch #2 */
SELECT "users".* FROM "users" WHERE "users"."id" >= 101 ORDER BY "users"."id" ASC LIMIT 1
SELECT "users".* FROM "users" WHERE "users"."id" >= 101 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 100
SELECT "users".* FROM "users" WHERE "users"."id" >= 101 AND "users"."id" < 201 ORDER BY "users"."id" ASC
/* Do something with result */

/* Get batch #3 */
SELECT "users".* FROM "users" WHERE "users"."id" >= 201 ORDER BY "users"."id" ASC LIMIT 1
SELECT "users".* FROM "users" WHERE "users"."id" >= 201 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 100
SELECT "users".* FROM "users" WHERE "users"."id" >= 201 ORDER BY "users"."id" ASC
/* Do something with result */

/* No more batches */

Whereas in_batches(of: 100, use_ranges: true) would give:

/* Get batch #1 */
SELECT "users"."id" FROM "users" ORDER BY "users"."id" ASC LIMIT 100
SELECT "users".* FROM "users" WHERE "users"."id" <= 100 
/* Do something with result */

/* Get batch #2 */
SELECT "users"."id" FROM "users" WHERE "users"."id" > 100 ORDER BY "users"."id" ASC LIMIT 100 
SELECT "users".* FROM "users" WHERE "users"."id" > 100 AND "users"."id" <= 200
/* Do something with result */

/* Get batch #3 */
SELECT "users"."id" FROM "users" WHERE "users"."id" > 200 ORDER BY "users"."id" ASC LIMIT 100
SELECT "users".* FROM "users" WHERE "users"."id" > 200 AND "users"."id" <= 250
/* Do something with result */

/* No more batches */

It would work exactly the same if passing any ActiveRecord::Relation object in place of the model User, e.g. User.where(condition: 'something'), with the additional condition appearing in the WHERE clause of each query.

For the selection of a range of record ids (id min / id max), BatchOver will generate two queries per batch, returns full records (and not only ids), but retrieves only 2 records. Conversely, in_batches(use_ranges: true) will generate a single query per batch, return only ids (and not full records), but would return the full list of ids instead of the min/max only.

I believe that trade-off is acceptable for our purpose, but that is debatable.

maximerety · 2024-02-26T10:54:40Z

Benchmark

Scenario

The benchmark iterates over a table having 30M records, in batches of 10k records.

The table has an auto-incrementing index starting from 1 and 65 columns (so returning full records is at least a little costly).

Measurements

The benchmark is executed on a single machine, with a round-trip-time < 1ms.

Network (I/O) stats obtained with:

docker stats --no-stream postgres --format 'table {{.NetIO}}'

Postgres I/O stats obtained with:

SELECT
  heap_blks_read, heap_blks_hit, idx_blks_read, idx_blks_hit
FROM
  pg_statio_user_tables
WHERE
  relname = '<the-table>';

The metrics below only account for the time spent generating the scopes, not using them. For example, should we have used the scopes generated with use_ranges: false, the duration and network spent would have been even worse because of the inclusion of a lot of ids in generated queries.

Results

Batching method	Duration	Network (I/O)	heap_blks_read/hit	idx_blks_read/hit
BatchOver	22.4 s.	16 MB / 16 MB	372k / 6k	96k / 1743k
BatchOver + optim	11.9 s.	3 MB / 1 MB	0k / 6k	96k / 1743k
in_batches(use_ranges: false)	21.7 s.	4 MB / 560 MB	0k / 3k	96k / 872k
in_batches(use_ranges: true)	17.5 s.	4 MB / 560 MB	0k / 3k	96k / 872k
in_batches(use_ranges: true) + optim	6.6 s.	1 MB / <1 MB	0k / 3k	96k / 872k

(*) + optim: see below

Additional optimizations

In BatchOver + optim, we reduce the number of database block reads and network used by requesting only record ids and not full records. This optimization is proposed in #138. In the case of the present benchmark, we are able use an Index Only Scan on the primary key instead of an Index Scan which is the case in which the optimization produces the greatest gains.

In in_batches(use_ranges: true) + optim, the optimization consists in querying only the last id of the range (LIMIT + OFFSET strategy actually taken from BatchOver) instead of returning the list of all ids in the range. So we get the best of both worlds: a single query + a single id returned. I'm preparing a fix to upstream to https://github.com/rails/rails.

frederic-martin-doctolib · 2024-02-26T16:15:19Z

Benchmark

Scenario

The benchmark iterates over a table having 30M records, in batches of 10k records.

The table has an auto-incrementing index starting from 1 and 65 columns (so returning full records is at least a little costly).

Measurements

The benchmark is executed on a single machine, with a round-trip-time < 1ms.

Network (I/O) stats obtained with:
docker stats --no-stream postgres --format 'table {{.NetIO}}'
Postgres I/O stats obtained with:
SELECT
  heap_blks_read, heap_blks_hit, idx_blks_read, idx_blks_hit
FROM
  pg_statio_user_tables
WHERE
  relname = '<the-table>';
The metrics below only account for the time spent generating the scopes, not using them. For example, should we have used the scopes generated with use_ranges: false, the duration and network spent would have been even worse because of the inclusion of a lot of ids in generated queries.

Results

Batching method Duration Network (I/O) heap_blks_read/hit idx_blks_read/hit
BatchOver 22.4 s. 16 MB /   16 MB 372k / 6k 96k / 1743k
BatchOver + optim 11.9 s. 3 MB /      1 MB 0k / 6k 96k / 1743k
in_batches(use_ranges: false) 21.7 s.   4 MB / 560 MB 0k / 3k 96k /   872k
in_batches(use_ranges: true) 17.5 s.   4 MB / 560 MB 0k / 3k 96k /   872k
in_batches(use_ranges: true) + optim 6.6 s.   1 MB /     <1 MB 0k / 3k 96k /   872k
(*) + optim: see below

Additional optimizations

In BatchOver + optim, we reduce the number of database block reads and network used by requesting only record ids and not full records. This optimization is proposed in #138. In the case of the present benchmark, we are able use an Index Only Scan on the primary key instead of an Index Scan which is the case in which the optimization produces the greatest gains.

In in_batches(use_ranges: true) + optim, the optimization consists in querying only the last id of the range (LIMIT + OFFSET strategy actually taken from BatchOver) instead of returning the list of all ids in the range. So we get the best of both worlds: a single query + a single id returned. I'm preparing a fix to upstream to https://github.com/rails/rails.

Did you restart pg between each run ? it's weird to see that we read data from disk only the first run ("BatchOver")

frederic-martin-doctolib · 2024-02-26T16:16:37Z

Did you restart pg between each run ? it's weird to see that we read data from disk only the first run ("BatchOver")

forget what i wrote, i understood my mistake ;)

frederic-martin-doctolib · 2024-02-26T16:29:31Z

lib/safe-pg-migrations/plugins/statement_insurer/add_column.rb

+
+        backfill_batch_size = SafePgMigrations.config.backfill_batch_size
+
+        if ActiveRecord.version >= Gem::Version.new('7.1')


until the patch on rails is not provided/approved/merged/released, i think is to early to switch towards rails implementation. WDYT ?

Agreed, let's keep this PR in draft and keep the new small optim from #138.

I've just pushed a proposal:

Avoidable resource consumption when using "in_batches(use_ranges: true)" rails/rails#51242

[Fix #51242] Rework in_batches(use_ranges: true) to be more efficient rails/rails#51243

Let's see if it gets some traction 🤞.

Starting from ActiveRecord 7.1, there's a built-in helper equivalent to what BatchOver does, let's use it instead of maintaining our own implementation forever. We keep BatchOver for compatibility with ActiveRecord < 7.1.

maximerety mentioned this pull request Feb 22, 2024

Misc updates to facilitate development / contributions #137

Merged

maximerety marked this pull request as ready for review February 23, 2024 10:05

maximerety requested a review from a team as a code owner February 23, 2024 10:05

maximerety force-pushed the prefer-in-batches-use-ranges branch from 92b4694 to af7d561 Compare February 25, 2024 20:42

maximerety mentioned this pull request Feb 26, 2024

BatchRecord: select primary keys instead of full records #138

Merged

frederic-martin-doctolib reviewed Feb 26, 2024

View reviewed changes

maximerety marked this pull request as draft February 26, 2024 17:02

Deprecate BatchOver in favor of in_batches(use_ranges: true)

8d89fba

Starting from ActiveRecord 7.1, there's a built-in helper equivalent to what BatchOver does, let's use it instead of maintaining our own implementation forever. We keep BatchOver for compatibility with ActiveRecord < 7.1.

maximerety force-pushed the prefer-in-batches-use-ranges branch from af7d561 to 8d89fba Compare February 28, 2024 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate BatchOver in favor of in_batches(use_ranges: true) #136

Deprecate BatchOver in favor of in_batches(use_ranges: true) #136

maximerety commented Feb 22, 2024 •

edited

maximerety commented Feb 26, 2024

frederic-martin-doctolib commented Feb 26, 2024

Benchmark

Scenario

Measurements

Results

Additional optimizations

frederic-martin-doctolib commented Feb 26, 2024 •

edited

frederic-martin-doctolib Feb 26, 2024

maximerety Feb 26, 2024

maximerety Mar 3, 2024


		backfill_batch_size = SafePgMigrations.config.backfill_batch_size

		if ActiveRecord.version >= Gem::Version.new('7.1')

Deprecate BatchOver in favor of in_batches(use_ranges: true) #136

Are you sure you want to change the base?

Deprecate BatchOver in favor of in_batches(use_ranges: true) #136

Conversation

maximerety commented Feb 22, 2024 • edited

maximerety commented Feb 26, 2024

Benchmark

Scenario

Measurements

Results

Additional optimizations

frederic-martin-doctolib commented Feb 26, 2024

Benchmark

Scenario

Measurements

Results

Additional optimizations

frederic-martin-doctolib commented Feb 26, 2024 • edited

frederic-martin-doctolib Feb 26, 2024

Choose a reason for hiding this comment

maximerety Feb 26, 2024

Choose a reason for hiding this comment

maximerety Mar 3, 2024

Choose a reason for hiding this comment

maximerety commented Feb 22, 2024 •

edited

frederic-martin-doctolib commented Feb 26, 2024 •

edited