This application takes four positional arguments.
-
--csv-location
This is the full path to the CSV that will be used as a lookup table for the application.db table pii db1 tab1 false db2 tab2 true db3 tab3 The application will check the contents of a table that follows the above structure to correctly tag S3 objects(tables).
-
--data-bucket
This is the bucket name where the dataset resides that requires tagging -
--data-s3-prefix
This is the S3 prefix(can be partial) that the application will crawl through to find objects(tables) to tag with the values from the CSV.Example:
Data_product_output/2021-01-28/
. The application will crawl through any objects it finds in that prefix.Partial example:
Data_product_output/2021-01-28/database_one/examp
The application will crawl through all prefixes that start withexamp
. It will find all objects inexample_one
andexample_two
if they exist -
--log-level
Optional argument. Default isINFO
-
--environment
The environment the app is being ran in. e.g. Development -
--application
The name to give to the application. This will show up in the logs
The required environment variables. They are replaced with the parameters passed in from the above arguments
Variable name | Example | Description |
---|---|---|
csv_location | s3://bucket/example/csv_file.csv | The full path to the CSV |
data_bucket | NOT_SET | Bucket name |
data_s3_prefix | NOT_SET | Prefix to crawl |
log_level | INFO | The desired log level, INFO or DEBUG |
environment | NOT_SET | The environment the app runs in. e.g. Development |
application | NOT_SET | The name of the application |
There are some assumptions made about the structure of the S3 objects and the data in the CSV
-
S3 structure
The structure of objects in the prefix are expected to look like this:
folder_name/db_name.db/table_name/<objects-to-tag>
<objects-to-tag>
are one or more part files oftable_name
that make up the table. -
CSV database names
The output from the data products creates databases with a
.db
suffix. When the object tagger runs, it will tag the S3 objects without the.db
suffix. The lookup CSV is expected to NOT have.db
suffix in its database names.
The application is deployed to DockerHub, after which it is mirrored to AWS ECR.
After cloning this repo, please run:
make bootstrap