New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve emulator performance for large projects #294
Comments
Hi @ohaibbq, |
Great. I will tag you in the PR that achieves the performance improvements. From my observed benchmarks, we were able to load 2,500 tables into the emulator at startup via |
@totem3 @goccy This is ready for your review if you'd like to take a look for more context on how the other PRs I opened up fit into the puzzle. |
I finally got the chance to compare runtimes now that all my PRs had been sorted out. A single request to the emulator takes ~18 seconds when the project has all of our 2,500 source tables added to it. Now most all requests are sub 15ms. |
@ohaibbq We mainly use the BigQuery emulator for testing. Previously, tests took over 15 minutes, but with this version, they finish in less than a minute. The tests are somewhat unstable, so they need to be checked. It’s possible that tests are failing due to concurrency issues because of the increased speed, so I am looking into that. |
That's great to hear! The impact you are seeing is likely mostly due to the API data access refactor. Our fork has only slightly diverged from the upstream repositories, some other notable performance improvements that may be improving your test times are here: |
What would you like to be added?
Hi @goccy @totem3, I'm filing this ticket as an epic to outline a few shortcomings we're running into.
I've undertaken work to address these issues and am nearing completion.
Our primary use-case facing these issues involves replacing a build that loads our view graph to BigQuery for testing. The graph currently encompasses ~2,500 tables and ~3,600 views, and ~3,000 tables materialized from views.
Outline of Issues
1. Project metadata middleware
Entire BigQuery project is loaded per request. When the project has thousands of tables, views, jobs, each request becomes very slow. The API data-access pattern should be changed to access only the data that pertains to a request.
2. SQLite always uses file-backed storage
In my tests, utilizing the memory-backed SQLite storage was nearly 3x faster. We should add an option to allow the emulator to run using in-memory storage.
3.
go-zetasqlite
usage leads to inefficient SQLite query plansFor complex BigQuery queries this is understandable, but for some heavily repeated queries that are known ahead of time (i.e.
metadata.Repository.FindProjects()
), we should try to avoid this.This problem compounds when doing simple primary-key based lookups. If we were to utilize SQLite's
CREATE TABLE ... WITHOUT ROWID
functionality, simple lookups essentially become hash lookups. Without it, SQLite cannot predict the equality of values, so it must scan the entire table.In order for SQLite to use its hash-based primary key lookup, we'd need to be using the native SQLite
=
operator.go-zetasqlite
rewrites these calls to usezetasqlite_equal()
, which is unnecessary for the metadata repository.4. Re-used repository queries do not use prepared statements
5.
--data-from-yaml
YAML parser is exceptionally slowWe have a script to populate our ~2,500 source table definitions into a data file to bootstrap the emulator.
Parsing this file takes many minutes when using YAML.
Parsing a JSON file with the same contents only takes ~75ms.
An alternative
--data-from-json
parameter to the binary.The text was updated successfully, but these errors were encountered: