Improve emulator performance for large projects #294

ohaibbq · 2024-04-09T22:15:26Z

What would you like to be added?

Hi @goccy @totem3, I'm filing this ticket as an epic to outline a few shortcomings we're running into.
I've undertaken work to address these issues and am nearing completion.

Our primary use-case facing these issues involves replacing a build that loads our view graph to BigQuery for testing. The graph currently encompasses ~2,500 tables and ~3,600 views, and ~3,000 tables materialized from views.

Outline of Issues

1. Project metadata middleware

Entire BigQuery project is loaded per request. When the project has thousands of tables, views, jobs, each request becomes very slow. The API data-access pattern should be changed to access only the data that pertains to a request.

2. SQLite always uses file-backed storage

In my tests, utilizing the memory-backed SQLite storage was nearly 3x faster. We should add an option to allow the emulator to run using in-memory storage.

3. `go-zetasqlite` usage leads to inefficient SQLite query plans

For complex BigQuery queries this is understandable, but for some heavily repeated queries that are known ahead of time (i.e. metadata.Repository.FindProjects()), we should try to avoid this.

This problem compounds when doing simple primary-key based lookups. If we were to utilize SQLite's CREATE TABLE ... WITHOUT ROWID functionality, simple lookups essentially become hash lookups. Without it, SQLite cannot predict the equality of values, so it must scan the entire table.

In order for SQLite to use its hash-based primary key lookup, we'd need to be using the native SQLite = operator.
go-zetasqlite rewrites these calls to use zetasqlite_equal(), which is unnecessary for the metadata repository.

4. Re-used repository queries do not use prepared statements

5. `--data-from-yaml` YAML parser is exceptionally slow

We have a script to populate our ~2,500 source table definitions into a data file to bootstrap the emulator.
Parsing this file takes many minutes when using YAML.

Parsing a JSON file with the same contents only takes ~75ms.

An alternative --data-from-json parameter to the binary.

The text was updated successfully, but these errors were encountered:

totem3 · 2024-04-11T00:26:56Z

Hi @ohaibbq,
Thank you for this summary and your work for improvement.
Since we also face performance problems, and the issues you summarized match what I found, I am excited to hear you are working on these performance issues.
I will try the Pull Requests you submitted.

ohaibbq · 2024-04-11T00:58:24Z

Great. I will tag you in the PR that achieves the performance improvements.
I'm still piecing out the prerequisite PRs in go-zetasqlite.

From my observed benchmarks, we were able to load 2,500 tables into the emulator at startup via --data-from-json data.json --database :memory: in 8 seconds.
We are also able to load and materialize 1,400 views in 35 seconds.

ohaibbq · 2024-04-12T00:13:47Z

@totem3 It will be a little while until the PR dependencies are merged into their respective @goccy repositories, but you should be able to build an emulator binary using this branch of our fork-

Recidiviz#12

ohaibbq · 2024-04-12T18:28:15Z

@totem3 @goccy This is ready for your review if you'd like to take a look for more context on how the other PRs I opened up fit into the puzzle.
Recidiviz#12

ohaibbq · 2024-04-12T23:11:18Z

I finally got the chance to compare runtimes now that all my PRs had been sorted out. A single request to the emulator takes ~18 seconds when the project has all of our 2,500 source tables added to it.

Now most all requests are sub 15ms.

totem3 · 2024-04-13T06:59:31Z

@ohaibbq
Thank you for sharing the branch! It made verification much easier for me.
I haven't looked at the changes in detail yet, but I've tried it.
I am excited at how much faster the tests have become.

We mainly use the BigQuery emulator for testing. Previously, tests took over 15 minutes, but with this version, they finish in less than a minute. The tests are somewhat unstable, so they need to be checked. It’s possible that tests are failing due to concurrency issues because of the increased speed, so I am looking into that.

ohaibbq · 2024-04-13T18:40:56Z

That's great to hear! The impact you are seeing is likely mostly due to the API data access refactor.

Our fork has only slightly diverged from the upstream repositories, some other notable performance improvements that may be improving your test times are here:
Recidiviz/go-zetasqlite#32
Recidiviz/go-zetasqlite#20

ohaibbq added the enhancement New feature or request label Apr 9, 2024

ohaibbq mentioned this issue Apr 12, 2024

Refactor API data-access pattern to only load what is necessary; use prepared statements Recidiviz/bigquery-emulator#12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve emulator performance for large projects #294

Improve emulator performance for large projects #294

ohaibbq commented Apr 9, 2024 •

edited

totem3 commented Apr 11, 2024

ohaibbq commented Apr 11, 2024

ohaibbq commented Apr 12, 2024 •

edited

ohaibbq commented Apr 12, 2024

ohaibbq commented Apr 12, 2024

totem3 commented Apr 13, 2024

ohaibbq commented Apr 13, 2024

Improve emulator performance for large projects #294

Improve emulator performance for large projects #294

Comments

ohaibbq commented Apr 9, 2024 • edited

What would you like to be added?

Outline of Issues

1. Project metadata middleware

2. SQLite always uses file-backed storage

3. go-zetasqlite usage leads to inefficient SQLite query plans

4. Re-used repository queries do not use prepared statements

5. --data-from-yaml YAML parser is exceptionally slow

totem3 commented Apr 11, 2024

ohaibbq commented Apr 11, 2024

ohaibbq commented Apr 12, 2024 • edited

ohaibbq commented Apr 12, 2024

ohaibbq commented Apr 12, 2024

totem3 commented Apr 13, 2024

ohaibbq commented Apr 13, 2024

ohaibbq commented Apr 9, 2024 •

edited

3. `go-zetasqlite` usage leads to inefficient SQLite query plans

5. `--data-from-yaml` YAML parser is exceptionally slow

ohaibbq commented Apr 12, 2024 •

edited