BigQuery Optimization via Anti-Pattern Recognition

This utility scans a BigQuery SQL in search for several possible anti-patterns. Anti-patterns are specific SQL syntaxes that in some cases might cause performance impact.

We recommend using this tool to scan the top 10% slot consuming jobs of your workload. Addressing these anti-patterns in most cases will provide performance significant benefits.

Example of tool input and output :

Input:

SELECT 
    * 
FROM 
    `project.dataset.table1`

Output:

All columns on table: project.dataset.table1 are being selected. Please be sure that all columns are needed

Quick Start

To run the tool use the cloud shell terminal. It has all the pre-requisites.

Build utility

# in cloud shell terminal
git clone https://github.com/GoogleCloudPlatform/bigquery-antipattern-recognition.git
cd bigquery-antipattern-recognition
mvn clean package jib:dockerBuild -DskipTests

Run tool for simple inline query

# in cloud shell terminal
docker run \
  -i bigquery-antipattern-recognition \
  --query "SELECT * FROM \`project.dataset.table1\`"

In the BigQuery console, run the DDL bellow to create th output table

-- in BQ console 
CREATE OR REPLACE TABLE <my-project>.<my-dataset>.antipattern_output_table (
  job_id STRING,
  user_email STRING,
  query STRING,
  recommendation ARRAY<STRUCT<name STRING, description STRING>>,
  slot_hours FLOAT64,
  optimized_sql STRING,
  process_timestamp TIMESTAMP
);

To read from INFORMATION_SCHEMA and write to the output table, run the following in the command line:

# in cloud shell terminal
gcloud auth login

docker run \
    -v ~/.config:/root/.config \
    -i bigquery-antipattern-recognition \
    --read_from_info_schema \
    --info_schema_project <project-name-with-info-schema-to-be-read> \
    --info_schema_region us \
    --read_from_info_schema_days 1 \
    --processing_project_id <my-processing-project> \
    --output_table "<my-project>.<my-dataset>.antipattern_output_table" \
    --info_schema_top_n_percentage_of_jobs 0.1

Read output in BigQuery Console

-- in BQ console
SELECT
  job_id, user_email, query, 
  recommendation, slot_hours
FROM 
  `<my-project>.<my-dataset>.antipattern_output_table`
ORDER BY
  process_timestamp DESC 
LIMIT 10000;

Other input / output options

Rewrite Using AI

This tool has an AI based feature that will output the rewritten query.
This functionality is activated using the --rewrite_sql flag.
(This feature is available in all outputs: logs, files, BigQuery table).

Prerequisite: enable Vertex AI API

Input

docker run \
    -v ~/.config:/root/.config \
    -i bigquery-antipattern-recognition \
    --query "select col1 from table1 where col2 like '%abc%' and col3=1" \
    --rewrite_sql \
    --processing_project_id <my-project>

Output

--------------------------------------------------
Recommendations for query: query provided by cli:
* WhereOrder: LIKE filter in line 1 precedes a more selective filter.
* Optimized query:
select col1 from table1 where col3=1 and col2 like '%abc%'
--------------------------------------------------

Deploy to Cloud Run Jobs

Deploying the anti-pattern recognition tool to a Cloud Run job provides an easy mechanism to periodically scan INFORMATION_SCHEMA for occurrences of anti-patterns in top slot consuming queries.

Refer to this step-by-step on how to deploy this tool using Cloud Run.

Dataform and DBT

See the examples page for information on extracting SQL from Dataform and DBT

Flags and arguments

Specify Input

To read inline query

--query="SELECT ... FROM ..."

To parse SQL string provided via CLI.

To read from INFORMATION_SCHEMA

--read_from_info_schema

To read input queries from INFORMATION_SCHEMA.JOBS.

--read_from_info_schema_days n

--info_schema_project <project-name>

--info_schema_region us

Region from which to read information schema

--read_from_info_schema_start_time "start-timestamp"
--read_from_info_schema_end_time "end-timestamp"

--read_from_info_schema_timeout_in_secs n

--info_schema_table_name "\`<my-project>.region-us\`.INFORMATION_SCHEMA.JOBS" \

Specifies what variant of INFORMATION_SCHEMA.JOBS to read from.

--info_schema_top_n_percentage_of_jobs n

Number between 0 and 1. Uses to specify what fraction of top slot consuming jobs the tool should consider, e.g. if equal to 0.1 only top 10% slot consuming jobs will be checked por anti patterns.

To read from a BigQuery Table

--input_bq_table project_id.dataset_id.table_name

To read from a files

--input_file_path /path/to/file.sql

Specifies path to file with SQL string to be parsed. Can be local file or GCS file.

--input_folder_path /path/to/folder/with/sql/files

--input_csv_file_path /path/to/input/file.csv

Specify output

--output_file_path /path/to/output/file.csv

--output_table "my-project.dataset.antipattern_output_table"

Specifies table to which write results to. Assumes that the table already exits.

Specify compute project

--processing_project_id <my-processing-project>

Using AI for rewrite

--rewrite_sql

Anti patterns

Anti Pattern 1: Selecting all columns

Example:

SELECT 
    * 
FROM 
    `project.dataset.table1`

Output:

All columns on table: project.dataset.table1 are being selected. Please be sure that all columns are needed

Anti Pattern 2: SEMI-JOIN without aggregation

Example:

SELECT 
   t1.col1 
FROM 
   `project.dataset.table1` t1 
WHERE 
    t1.col2 not in (select col2 from `project.dataset.table2`);

Output:

You are using an IN filter with a subquery without a DISTINCT on the following columns: project.dataset.table1.col2

Anti Pattern 3: Multiple CTEs referenced more than twice

Example:

WITH
  a AS (
  SELECT col1,col2 FROM test WHERE col1='abc' 
  ),
  b AS ( 
    SELECT col2 FROM a 
  ),
  c AS (
  SELECT col1 FROM a 
  )
SELECT
  b.col2,
  c.col1
FROM
  b,c;

Output:

CTE with multiple references: alias a defined at line 2 is referenced 2 times

Anti Pattern 4: Using ORDER BY without LIMIT

Example:

SELECT
  t.dim1,
  t.dim2,
  t.metric1
FROM
  `dataset.table` t
ORDER BY
  t.metric1 DESC;

Output:

ORDER BY clause without LIMIT at line 8.

Anti Pattern 5: Using REGEXP_CONTAINS when LIKE is an option

Example:

SELECT
  dim1
FROM
  `dataset.table`
WHERE
  REGEXP_CONTAINS(dim1, ‘.*test.*’)

Output:

REGEXP_CONTAINS at line 6. Prefer LIKE when the full power of regex is not needed (e.g. wildcard matching).";

Anti Pattern 6: Using an analytic functions to determine latest record

Example:

SELECT
  taxi_id, trip_seconds, fare
FROM
  (
  SELECT
    taxi_id, trip_seconds, fare,
    row_number() over(partition by taxi_id order by fare desc) rn
  FROM
    `bigquery-public-data.chicago_taxi_trips.taxi_trips`
)
WHERE
  rn = 1;

Output:

LatestRecordWithAnalyticFun: Seems like you might be using analytical function row_number in line 7 to filter the latest record in line 12.

Anti Pattern 7: Convert Dynamic Predicates into Static

Example:

SELECT
 *
FROM 
  comments c
JOIN 
  users u ON c.user_id = u.id
WHERE 
  u.id IN (
    SELECT id 
    FROM users
    WHERE location LIKE '%New York'
    GROUP BY id
    ORDER BY SUM(up_votes) DESC
    LIMIT 10
  )
;

Output:

Dynamic Predicate: Using subquery in filter at line 10. Converting this dynamic predicate to static might provide better performance.

Anti Pattern 8: Where order, apply most selective expression first

Example:

SELECT 
  repo_name, 
  id,
  ref
FROM 
  `bigquery-public-data.github_repos.files` 
WHERE
  ref like '%master%'
  and repo_name = 'cdnjs/cdnjs'
;

Output:

WhereOrder: LIKE filter in line 8 precedes a more selective filter.

Anti Pattern 9: Join Order

As a best practice the table with the largest number of rows should be placed first in a JOIN.

This anti-pattern checks the join order based on the number of rows of each table. To do so this tool must fetch table metadata, for which the advanced_analysis flag must be used.

Details can be found here.

Example:

SELECT  
  t1.station_id,
  COUNT(1) num_trips_started
FROM
  `bigquery-public-data.austin_bikeshare.bikeshare_stations` t1
JOIN
  `bigquery-public-data.austin_bikeshare.bikeshare_trips` t2 ON t1.station_id = t2.start_station_id
GROUP BY
  t1.station_id
;

Output:

JoinOrder: JOIN on tables: [bikeshare_stations, bikeshare_trips] might perform 
better if tables where joined in the following order: 
[bikeshare_trips, bikeshare_stations]

License

Copyright 2024 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.

Name		Name	Last commit message	Last commit date
Latest commit History 181 Commits
.github/workflows		.github/workflows
bigquery-antipattern-recognition		bigquery-antipattern-recognition
cicd		cicd
samples		samples
terraform		terraform
.gcloudignore		.gcloudignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
CR_DEPLOY.md		CR_DEPLOY.md
EXAMPLES.md		EXAMPLES.md
JAVA_LICENSE_HEADER		JAVA_LICENSE_HEADER
LICENSE		LICENSE
README.md		README.md
cloudbuild.yaml		cloudbuild.yaml
pom.xml		pom.xml

License

GoogleCloudPlatform/bigquery-antipattern-recognition

Folders and files

Latest commit

History

Repository files navigation

BigQuery Optimization via Anti-Pattern Recognition

Quick Start

Other input / output options

Rewrite Using AI

Deploy to Cloud Run Jobs

Dataform and DBT

Flags and arguments

Specify Input

To read inline query

To read from INFORMATION_SCHEMA

To read from a BigQuery Table

To read from a files

Specify output

Specify compute project

Using AI for rewrite

Anti patterns

Anti Pattern 1: Selecting all columns

Anti Pattern 2: SEMI-JOIN without aggregation

Anti Pattern 3: Multiple CTEs referenced more than twice

Anti Pattern 4: Using ORDER BY without LIMIT

Anti Pattern 5: Using REGEXP_CONTAINS when LIKE is an option

Anti Pattern 6: Using an analytic functions to determine latest record

Anti Pattern 7: Convert Dynamic Predicates into Static

Anti Pattern 8: Where order, apply most selective expression first

Anti Pattern 9: Join Order

License

About

Resources

License

Stars

Watchers

Forks

Languages