From d5ef401903ff0de6d28d6905a3e232389a1679c8 Mon Sep 17 00:00:00 2001 From: Adler Santos Date: Tue, 10 Aug 2021 13:15:28 -0400 Subject: [PATCH] docs: Add best practices for percentage data and hierarchies (#125) * docs: Add best practices for percentage data and hierarchies * Update README.md * recommend BQ nested columns as a best practice --- README.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/README.md b/README.md index 1ead1b08b..5ca59c28b 100644 --- a/README.md +++ b/README.md @@ -227,6 +227,21 @@ Every dataset and pipeline folder must contain a `dataset.yaml` and a `pipeline. # Best Practices +- When your tabular data contains percentage values, represent them as floats between 0 to 1. +- To represent hierarchical data in BigQuery, use either: + - (Recommended) Nested columns in BigQuery. For more info, see [the documentation on nested and repeated columns](https://cloud.google.com/bigquery/docs/nested-repeated). + - Or, represent each level as a separate column. For example, if you have the following hierarchy: `chapter > section > subsection`, then represent them as + + ``` + |chapter |section|subsection |page| + |-----------------|-------|--------------------|----| + |Operating Systems| | |50 | + |Operating Systems|Linux | |51 | + |Operating Systems|Linux |The Linux Filesystem|51 | + |Operating Systems|Linux |Users & Groups |58 | + |Operating Systems|Linux |Distributions |70 | + ``` + - When running `scripts/generate_terraform.py`, the argument `--bucket-name-prefix` helps prevent GCS bucket name collisions because bucket names must be globally unique. Use hyphens over underscores for the prefix and make it as unique as possible, and specific to your own environment or use case. - When naming BigQuery columns, always use `snake_case` and lowercase. - When specifying BigQuery schemas, be explicit and always include `name`, `type` and `mode` for every column. For column descriptions, derive it from the data source's definitions when available.