The Cast-Iron Worker example leverages several Python libraries to accomplish the ETL process.
- Install Python 3.8, preferably using Pyenv
$ pyenv install
- This project utilizes Poetry for managing python dependencies.
$ curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
- Install dependencies
$ poetry install
- Add
127.0.0.1 kafka
entry to your /etc/hosts file - Start the Cast-Iron ETL worker
- Locally
$ poetry shell $ faust -A python_worker.etl worker -l info
- To run with a debugger use
python worker.py
- Docker
$ docker-compose up --build
With the docker containers running and the worker running in either a container or locally
- Navigate to MinIO
http://localhost:9000
- Add
example_config.toml
to theetl
bucket - Refresh the page to verify that additional etl buckets are created
- Navigate into
01_inbox
- Add
data/data_test.tsv
- TSV should be ETL-ed
- TSV moves to the
archive_dir
bucket
The processor config handled_file_glob
configures file extension pattern matching. The matchers should be provided as e.g. _test.tsv|_updated.csv|.mp3
(no spaces).
The processor config handled_mimetypes
specifies Tika mimetypes for a processor to match. Its value should be a comma-separated string of mimetypes, e.g. application/pdf,application/vnd.openxmlformats-officedocument.wordprocessingml.document
- Note: in order to enable Tika mimetype matching, the environment setting
ENABLE_TIKA
must be set to a truthy value. See theSettings
section below for details about environment settings.
Files are matched to processors as such: for a single file, checks are made based on processor configurations, one processor at a time.
- The first processor that is found to match the file is used to process the file, and the rest are ignored.
- So if two processors could have each matched a file, the order in which the processors are checked determines which matches and which is ignored.
- One or the other, or both, of
handled_mimetypes
andhandled_file_glob
can be specified for a processor.- If both are specified, mimetype checking is tried first, then file extension glob if mimetype failed or returned False for that processor.
- Each processor will check both mimetype and file extension glob matching before moving on to the next processor.
Toml is used to create configuration files that can be used to tell the worker how to ETL a given file.
An example configuration file can be seen in the example_config.toml
and the example_python_config.toml
.
Several buckets are used as stages in the ETL process. These buckets are defined in the toml config file. The buckets are created i
inbox_dir
processing_dir
archive_dir
error_dir
The Settings
class allows for the usage of environment variables or a .env
file to supply the appropriate arguments
based on its inheritance of Pyandatic's BaseSettings
class. Additional details on how BaseSettings
works can be
found in the Pydantic documentation.