Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve emulator performance for large projects #294

Open
ohaibbq opened this issue Apr 9, 2024 · 7 comments
Open

Improve emulator performance for large projects #294

ohaibbq opened this issue Apr 9, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@ohaibbq
Copy link
Contributor

ohaibbq commented Apr 9, 2024

What would you like to be added?

Hi @goccy @totem3, I'm filing this ticket as an epic to outline a few shortcomings we're running into.
I've undertaken work to address these issues and am nearing completion.

Our primary use-case facing these issues involves replacing a build that loads our view graph to BigQuery for testing. The graph currently encompasses ~2,500 tables and ~3,600 views, and ~3,000 tables materialized from views.

Outline of Issues

1. Project metadata middleware

Entire BigQuery project is loaded per request. When the project has thousands of tables, views, jobs, each request becomes very slow. The API data-access pattern should be changed to access only the data that pertains to a request.


2. SQLite always uses file-backed storage

In my tests, utilizing the memory-backed SQLite storage was nearly 3x faster. We should add an option to allow the emulator to run using in-memory storage.


3. go-zetasqlite usage leads to inefficient SQLite query plans

For complex BigQuery queries this is understandable, but for some heavily repeated queries that are known ahead of time (i.e. metadata.Repository.FindProjects()), we should try to avoid this.

This problem compounds when doing simple primary-key based lookups. If we were to utilize SQLite's CREATE TABLE ... WITHOUT ROWID functionality, simple lookups essentially become hash lookups. Without it, SQLite cannot predict the equality of values, so it must scan the entire table.

In order for SQLite to use its hash-based primary key lookup, we'd need to be using the native SQLite = operator.
go-zetasqlite rewrites these calls to use zetasqlite_equal(), which is unnecessary for the metadata repository.


4. Re-used repository queries do not use prepared statements


5. --data-from-yaml YAML parser is exceptionally slow

We have a script to populate our ~2,500 source table definitions into a data file to bootstrap the emulator.
Parsing this file takes many minutes when using YAML.

Parsing a JSON file with the same contents only takes ~75ms.

An alternative --data-from-json parameter to the binary.

@totem3
Copy link
Sponsor Collaborator

totem3 commented Apr 11, 2024

Hi @ohaibbq,
Thank you for this summary and your work for improvement.
Since we also face performance problems, and the issues you summarized match what I found, I am excited to hear you are working on these performance issues.
I will try the Pull Requests you submitted.

@ohaibbq
Copy link
Contributor Author

ohaibbq commented Apr 11, 2024

Great. I will tag you in the PR that achieves the performance improvements.
I'm still piecing out the prerequisite PRs in go-zetasqlite.

From my observed benchmarks, we were able to load 2,500 tables into the emulator at startup via --data-from-json data.json --database :memory: in 8 seconds.
We are also able to load and materialize 1,400 views in 35 seconds.

@ohaibbq
Copy link
Contributor Author

ohaibbq commented Apr 12, 2024

@totem3 It will be a little while until the PR dependencies are merged into their respective @goccy repositories, but you should be able to build an emulator binary using this branch of our fork-

Recidiviz#12

@ohaibbq
Copy link
Contributor Author

ohaibbq commented Apr 12, 2024

@totem3 @goccy This is ready for your review if you'd like to take a look for more context on how the other PRs I opened up fit into the puzzle.
Recidiviz#12

@ohaibbq
Copy link
Contributor Author

ohaibbq commented Apr 12, 2024

I finally got the chance to compare runtimes now that all my PRs had been sorted out. A single request to the emulator takes ~18 seconds when the project has all of our 2,500 source tables added to it.

Now most all requests are sub 15ms.

@totem3
Copy link
Sponsor Collaborator

totem3 commented Apr 13, 2024

@ohaibbq
Thank you for sharing the branch! It made verification much easier for me.
I haven't looked at the changes in detail yet, but I've tried it.
I am excited at how much faster the tests have become.

We mainly use the BigQuery emulator for testing. Previously, tests took over 15 minutes, but with this version, they finish in less than a minute. The tests are somewhat unstable, so they need to be checked. It’s possible that tests are failing due to concurrency issues because of the increased speed, so I am looking into that.

@ohaibbq
Copy link
Contributor Author

ohaibbq commented Apr 13, 2024

That's great to hear! The impact you are seeing is likely mostly due to the API data access refactor.

Our fork has only slightly diverged from the upstream repositories, some other notable performance improvements that may be improving your test times are here:
Recidiviz/go-zetasqlite#32
Recidiviz/go-zetasqlite#20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants