Skip to content
James Baker edited this page Apr 27, 2017 · 2 revisions

Since Baleen 2.2, a Baleen Jobs framework has been available to allow users to run tasks outside of a standard Baleen pipeline. There are a number of example use cases for this, such as running a task over the whole corpus, gathering statistics, or performing clean up operations (such as deleting temporary files). Tasks can be run as a one off occurrence, or can be configured to run on a schedule (such as every 12 hours).

Jobs (which are one or more tasks) can be configured through YAML configuration files or through the REST API. This guide will deal with configuring them through YAML configuration files, though the file content can also be submitted via the REST API.

Configuring a Job

Jobs can be configured through YAML configuration in a similar manner to Baleen Pipelines. They should contain zero or one schedule objects (comparable to a collection reader), and a list of tasks (comparable to annotators). The default schedule is Once if an alternative is not provided, and tasks are always run in the order specified. As with pipelines, global configuration can also be provided.

mongo:
  db: example

schedule:
  class: FixedDelay
  period: 300
tasks:
  - MongoStats

The following schedules are available:

  • FixedDelay - Run the job x seconds after the previous job completes, where x is specified by the period parameter
  • FixedRate - Run the job x seconds after the previous job starts (assuming it has completed), where x is specified by the period parameter
  • Once - Run the job a single time (default)
  • Repeat - Run the job x number of times with a delay of y seconds after the previous job completes, where x and y are specified by the count and period> parameters respectively

Adding a Job to Baleen

Jobs can be added to the Baleen configuration in the same way pipelines can be, although they use a jobs object rather than a pipelines one.

jobs:
  - file: Example_Job.yml
    name: Example Job

Versions prior to Baleen 2.4

Note that the format described above is correct as of Baleen 2.4. In previous versions, an additional job block was required in the Job YAML file, e.g.

job:
  schedule:
    class: FixedDelay
    period: 300
  tasks:
    - MongoStats