In session 1 we covered:
- Setting up and ingesting a source database (Cloud SQL Postgres)
- Loading source data to a data warehouse (BigQuery) using Airbyte
- Transforming the raw source data in the data warehouse to be used for analytical use cases using dbt
- We've looked at the following dbt concepts:
- General setup
- Defining sources and creating data models
- Running dbt and using different model materializations
- Writing tests and documentation
In this session we will continue with dbt by covering more advanced topics
- Working with packages
- Writing custom tests
- Writing and using marcos
- Using
incremental
materialization - Using snapshots
Besides this we will also go through an example of how we can use one of the analytical data sets we've created in a dashboard.
Not part of this session but in reality there are of course many more ways these tables can be used and from a data engineers perspective one of the most important things to figure out is how to handle access and monitoring.
Similarly to most programming languages, e.g. Python, dbt also uses the concept of packages to share third-party code, generally dbt macros, which is a good way to more easily add useful functionality to your dbt project. These packages are hosted on the Packages Hub.
In the below exercise we will install several packages that will give use more generic tests to use.
We will install the dbt_utils
package.
It is maintained by dbt-labs
and adds some generally useful macros.
-
In the dbt project's root, i.e. the directory with
dbt_project.yml
, create a filepackages.yml
- Here we will define all our (package) dependencies
-
Add this to
packages.yml
packages: - package: dbt-labs/dbt_utils version: 1.0.0
-
To install this package run
dbt deps
from the dbt project's root -
Try to come up with relevant tests using
dbt_utils
defined macros. Here are some test suggestions:- Use
not_empty_string
to verifycategories.category_name
is not an empty string. - Use
accepted_range
to verifyorder_details.discount
has a lower bound of0
(inclusive) and an upper bound of1
(inclusive) - Use
unique_combination_of_columns
to verify the combinationorder_id
andproduct_id
uniquely identifies a row inorder_detail
NOTE: These could be added on the sources directly in
staging/_sources.yml
or on the staging models instaging/_models.yml
- Use
-
Run
dbt test
to make sure all tests pass
Supplementary exercise:
Have a look at the dbt-expectations
package and use one of its macros for a test.
In session 1 (and the above exercise) we used predefined tests. Now we will write our own custom tests.
Note that it is also possible to write custom singular tests. These are one-off assertions usable for a single purpose.
We will write a generic reusable test. Instead of error-ing this specific test will warn when it fails.
-
Create file
tests/generic/warn_if_greater_than.sql
-
Write a test that will warn if a numeric value is greater than the provided value
Solution
{% test warn_if_greater_than(model, column_name, max_value=1000) %} {{ config(severity = 'warn') }} select {{ column_name }} from {{ model }} where {{ column_name }} > {{ max_value }} {% endtest %}
NOTE: dbt tests are select statements which fail if the select statement returns a non-zero amount of rows.
- Try it out:
-
Warn if
orders.freight
is greater than 999NOTE: This test can be added on the source directly in
staging/_sources.yml
or on the staging model instaging/_models.yml
Solution
version: 2 sources: - name: <YOUR_UNIQUE_PREFIX>_northwind_raw tables: - name: orders columns: - name: freight tests: - warn_if_greater_than: max_value: 999
-
Run
dbt test
to verify the warning gets output
-
It is also possible to override the default 'error' severity of predefined tests to 'warn'. This requires adding additional configuration, see the docs.
We have already used several macros throughout the previous exercises. Now will write and use our own custom macros.
-
Create the file
macros/human_formatted_currency_amount.sql
-
Create file
macros/_macros.yml
-
Write a macro that formats a numeric value as a dollar currency string, i.e.
$300,499.50
or$9.00
Solution
{% macro to_dollar_currency(column_name) %} concat("$", format("%'.2f", cast({{ column_name }} as float64))) {% endmacro %}
-
Document the macro in
macros/_macros.yml
- Description the macro and it parameters
- This will also be visible in the generate docs,
dbt docs generate
Solution
version: 2 macros: - name: to_dollar_currency description: A macro to convert a numeric value to dollar currency formatted string arguments: - name: column_name type: string description: The name of the column you want to convert
-
Use the macro to create a new column in
employee_sales
that contains the dollar formatted total salesSolution
... employee_sales.total_sale_amount, {{ to_dollar_currency('employee_sales.total_sale_amount') }} as total_sale_dollar_fmt, ...
-
Jinja templates (which include macros) are evaluate at compile time.
- To see what the macro evaluates to run
dbt compile 'models/marts/employee_sales.sql'
- And look at the compiled output (
target/compiled/models/marts/employee_sales.sql
)
- To see what the macro evaluates to run
Traditional data models run code against the entire dataset each time they are executed, overwriting previous results. Incremental models, on the other hand, run code against the entire dataset only once during the initial run and then only execute on new data in subsequent runs.
Read about pros/cons of different materializations here
- Simulate adding additional rows by inserting a few samples to the
orders
raw table.
NOTE: generally these rows would be added to the actual source (in our case Postgres), however as an example it is easier to add new rows directly to the raw table on BigQuery
- Execute the following insert statements from the BigQuery GCP console:
NSERT INTO modern-data-stack-training.<YOUR_UNIQUE_PREFIX>_northwind_raw.orders (freight, order_id, ship_via, ship_city, ship_name, order_date, customer_id, employee_id, ship_region, ship_address, ship_country, shipped_date, required_date, ship_postal_code, _airbyte_ab_id, _airbyte_emitted_at, _airbyte_normalized_at, _airbyte_orders_hashid)
VALUES (10.0, 15335, 1, 'Amsterdam', 'Pieter van der Linden', '2022-01-01', 'CONSH', 1, 'Noord', '123 Hoofdstraat', 'NL', '2022-02-01', '2022-03-01', '1000 AB', 'ab_id_1', '2022-01-01 12:00:00', '2022-01-01 12:00:00', 'hash_id_1');
INSERT INTO modern-data-stack-training.<YOUR_UNIQUE_PREFIX>_northwind_raw.orders (freight, order_id, ship_via, ship_city, ship_name, order_date, customer_id, employee_id, ship_region, ship_address, ship_country, shipped_date, required_date, ship_postal_code, _airbyte_ab_id, _airbyte_emitted_at, _airbyte_normalized_at, _airbyte_orders_hashid)
VALUES (20.0, 15472, 2, 'Rotterdam', 'Sophie de Boer', '2022-02-01', 'NORTS', 2, 'West', '456 Keizersgrachtstraat', 'NL', '2022-03-01', '2022-04-01', '3000 AB', 'ab_id_2', '2022-02-01 12:00:00', '2022-02-01 12:00:00', 'hash_id_2');
INSERT INTO orders (freight, order_id, ship_via, ship_city, ship_name, order_date, customer_id, employee_id, ship_region, ship_address, ship_country, shipped_date, required_date, ship_postal_code, _airbyte_ab_id, _airbyte_emitted_at, _airbyte_normalized_at, _airbyte_orders_hashid)
VALUES (15.0, 10462, 3, 'Amsterdam', 'Thomas Jansen', '2022-03-01', 'NORTS', 3, 'Midden', '789 Bloemstraatweg', 'NL', '2022-04-01', '2022-05-01', '1000 AB', 'ab_id_3', '2022-03-01 12:00:00', '2022-03-01 12:00:00', 'hash_id_3');
INSERT INTO orders (freight, order_id, ship_via, ship_city, ship_name, order_date, customer_id, employee_id, ship_region, ship_address, ship_country, shipped_date, required_date, ship_postal_code, _airbyte_ab_id, _airbyte_emitted_at, _airbyte_normalized_at, _airbyte_orders_hashid)
VALUES (12.0, 19472, 1, 'Rotterdam', 'Marieke van den Berg', '2022-04-01', 'DUMON', 4, 'West', '246 Vrijburglaan', 'NL', '2022-05-01', '2022-06-01', '3001 CD', 'ab_id_4', '2022-04-01 12:00:00', '2022-04-01 12:00:00', 'hash_id_4');
INSERT INTO orders (freight, order_id, ship_via, ship_city, ship_name, order_date, customer_id, employee_id, ship_region, ship_address, ship_country, shipped_date, required_date, ship_postal_code, _airbyte_ab_id, _airbyte_emitted_at, _airbyte_normalized_at, _airbyte_orders_hashid)
VALUES (8.0, 14450, 2, 'Den Haag', 'Jan Pieters', '2022-05-01', 'BLONP', 5, 'Zuid', '369 Damstraatje', 'NL', '2022-06-01', '2022-07-01', '2501 AB', 'ab_id_5', '2022-05-01 12:00:00', '2022-05-01 12:00:00', 'hash_id_5');
- Change materialization on
stg_orders
model toincremental
. - Utilize the
is_incremental()
macro to filter for new rows only. - [Optional:] Configure materialization with a
unique_key='order_id'
to allow for modification of existing rows rather than only adding new ones. - Run
dbt run -s stg_orders
- Only added order rows have been processed
- Number of rows processed should be visible in logs
- [Optional:] run the same model with
--full-refresh
to force rebuilding the table from scratch.
NOTE: about the complexity incremental models introduce see: https://www.getdbt.com/coalesce-2021/trials-and-tribulations-of-incremental-models/
DBT snapshots record changes made to a mutable model over time, enabling point-in-time queries. This feature uses Slowly Changing Dimensions type-2 and tracks the validity of each row with "from" and "to" date columns.
-
Create file
snapshots/employees.sql
-
Update the content of
snapshots/employees.sql
file with the snapshot code{% snapshot employees_snapshot %}
- Configure the snapshot
target_schema='<YOUR_UNIQUE_PREFIX>_dbt'
strategy='check'
(Read more about it here)unique_key='employee_id'
check_cols=['address', 'postal_code', 'home_phone', 'title_of_courtesy']'
- In the body, select from the
employees
source (as done in the staging models)
- Configure the snapshot
-
Run the command
dbt snapshot
to create the snapshot table
Examine the created table
employees_snapshot
, you will observe that dbt has incorporated the columns "dbt_valid_from" and "dbt_valid_to," with the latter set to null values. Future executions will modify this
- Update one of the employee existing records in the raw table
UPDATE `<YOUR_UNIQUE_PREFIX>_northwind_raw.employees`
SET address = '3D Herengrachtstraat', home_phone = '06-12345678'
WHERE employee_id = 7;
- Re-run the
dbt snapshot
to capture the changes- Check out
employees_snapshot
for the changes (related toemployee_id = 7
)
- Check out
Having transformed the data, it's now time to convert it into informative and engaging reports. To achieve this, we will utilize Looker Studio, formerly known as Google Data Studio. This platform offers a user-friendly interface for crafting interactive dashboards, charts, and reports, providing a seamless experience for data visualization and analysis.
Read more about Looker Studio features here.
-
Create new report
-
Add the following analytical tables from
modern-data-stack-training.<YOUR_UNIQUE_PREFIX>_dbt
as data sourcescategory_sales
employee_sales
orders_per_month
- After all three have been added they should be visible in the right hand side sidebar
"Data"
.
-
Some chart suggestions:
- Create a bar chart for
category_sales
- Create a bar chart for
employees_sales
- Create a time series chart for
orders_per_month
- First add a
order_year_month
field toorders_per_month
as a date type- Once added select
Date & Time -> Year Month
from the dropdown of that field's type on the overview - Press
Done
- Once added select
- On the chart add a
Time Series Filter
onorder_year
to only select orders before the year 2000 (to remove outliers)
- First add a
- Create a bar chart for
- What is the modern data stack?
- Understanding the Modern Data Stack
- What’s So Modern About the Modern Data Stack?
- Is ETL Still Relevant?
- dbt best practices
- dbt fundamentals free courses
- dbt discourse
- Jinja & Macros for modular and cleaner SQL Queries Part 1 Part 2