Skip to content

black-cape/cast-iron-worker

Repository files navigation

Cast-Iron Worker Example

Getting Started

The Cast-Iron Worker example leverages several Python libraries to accomplish the ETL process.

Installing Dependencies

  • Install Python 3.8, preferably using Pyenv
$ pyenv install
  • This project utilizes Poetry for managing python dependencies.
$ curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
  • Install dependencies
$ poetry install

Start the Worker

  1. Add 127.0.0.1 kafka entry to your /etc/hosts file
  2. Start the Cast-Iron ETL worker
    • Locally
    $ poetry shell
    $ faust -A python_worker.etl worker -l info
    
    • To run with a debugger use python worker.py
    • Docker
    $ docker-compose up --build
    

Utlize the ETL

With the docker containers running and the worker running in either a container or locally

  1. Navigate to MinIO http://localhost:9000
  2. Add example_config.toml to the etl bucket
  3. Refresh the page to verify that additional etl buckets are created
  4. Navigate into 01_inbox
  5. Add data/data_test.tsv
  6. TSV should be ETL-ed
  7. TSV moves to the archive_dir bucket

Matching files to processors

The processor config handled_file_glob configures file extension pattern matching. The matchers should be provided as e.g. _test.tsv|_updated.csv|.mp3 (no spaces).

The processor config handled_mimetypes specifies Tika mimetypes for a processor to match. Its value should be a comma-separated string of mimetypes, e.g. application/pdf,application/vnd.openxmlformats-officedocument.wordprocessingml.document

  • Note: in order to enable Tika mimetype matching, the environment setting ENABLE_TIKA must be set to a truthy value. See the Settings section below for details about environment settings.

Files are matched to processors as such: for a single file, checks are made based on processor configurations, one processor at a time.

  • The first processor that is found to match the file is used to process the file, and the rest are ignored.
    • So if two processors could have each matched a file, the order in which the processors are checked determines which matches and which is ignored.
  • One or the other, or both, of handled_mimetypes and handled_file_glob can be specified for a processor.
    • If both are specified, mimetype checking is tried first, then file extension glob if mimetype failed or returned False for that processor.
    • Each processor will check both mimetype and file extension glob matching before moving on to the next processor.

Technology

Toml

Toml is used to create configuration files that can be used to tell the worker how to ETL a given file.

An example configuration file can be seen in the example_config.toml and the example_python_config.toml.

MinIO

Several buckets are used as stages in the ETL process. These buckets are defined in the toml config file. The buckets are created i

  • inbox_dir
  • processing_dir
  • archive_dir
  • error_dir

Settings

The Settings class allows for the usage of environment variables or a .env file to supply the appropriate arguments based on its inheritance of Pyandatic's BaseSettings class. Additional details on how BaseSettings works can be found in the Pydantic documentation.