dpypeline

Program for creating data pipelines triggered by file creation events.

Version

0.1.0-beta.4

Python enviroment setup

To utilise this package, it should be installed within a dedicated Conda environment. You can create this environment using the following command:

conda create --name <environment_name> python=3.10

To activate the conda environment use:

conda activate <environment_name>

Alternatively, use virtualenv to setup and activate the environment:

python -m venv <environment_name>
source <envionment_name>/bin/activate

Installation

Clone the repository:

git clone git@github.com:NOC-OI/dpyepline.git

Navigate to the package directory:

After cloning the repository, navigate to the root directory of the package.

Install in editable mode:

To install dpypeline in editable mode, execute the following comman from the root directory:

pip install -e .

This command will install the library in editable mode, allowing you to make changes to the code if needed.

Alternative installation methods:

Install from the GitHub repository directly:

pip install git+https://github.com/NOC-OI/dpypeline.git@main#egg=dpypeline

Install from the PyPI repository:

pip install dpypeline

Unit tests

Run tests using pytest in the main directory:

pip install pytest
pytest

Examples

Python scripts

Examples of Python scripts explaining how to use this package can be found in the examples directory.

Command line interface (CLI)

The CLI provided by this package allows you to execute data pipelines defined in YAML files; however, it offers less flexibility compared to using the Python scripts. To run the dpypeline CLI, type, e.g., the following command:

dpypeline -i <input_file> > output 2> errors

Flags description

-h or --help: show an help message
-i INPUT_FILE or --input INPUT_FILE: Filepath to the pipeline YAML file (by default pipelien.yaml)
-v or --version: show dpypeline's version umber

Environment variables

There are a few environment variables that need to be set so that the application can run correctly:

CACHE_DIR: Path to the cache directory.

Software Workflow Overview

Pipeline architectures

Thread-based pipeline

In the thread-based pipeline, Akita enqueues events into an in-memory queue. These events are subsequently consumed by ConsumerSerial, which generates jobs for sequential execution within the ThreadPipeline (an alias for BasicPipeline).

Parallel pipeline

In the parallel pipeline, Akita enqueues events into an in-memory queue. These events are then consumed by ConsumerParallel, which generates futures that are executed concurrently by multiple Dask workers.

Name		Name	Last commit message	Last commit date
Latest commit History 219 Commits
.github/workflows		.github/workflows
conda-recipe		conda-recipe
docs		docs
examples		examples
images		images
src/dpypeline		src/dpypeline
tests		tests
utils		utils
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENCE.md		LICENCE.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

NOC-OI/dpypeline

Folders and files

Latest commit

History

Repository files navigation