Skip to content
James Baker edited this page Apr 27, 2017 · 2 revisions

This page provides a quick overview of Baleen and it's configuration options. For full documentation, see the JavaDoc included within Baleen.

Running Baleen

To start the Baleen server, run the following command from the home directory of Baleen: java -jar baleen-2.x.0.jar

Note that the name of the JAR file will vary depending on the version of Baleen that you are running. This applies to all references of JAR files in this document.

In order to have access to the JavaDoc on your server, you must have baleen-2.x.0-javadoc.jar in the same directory as the Baleen JAR. If this file isn't present, JavaDoc will not be accessible.

Once running, you can access it at http://localhost:6413.

Configuring Baleen

Baleen is configured via a YAML file, that is passed to the server on start up. The YAML file can be called whatever you want, but in this document we will refer to it as config.yml.

To pass the YAML file to Baleen, use the following command:java -jar baleen-2.x.0.jar config.yml

The configuration file contains a number of different sections that refer to different aspects of Baleen, such as the server configuration, default pipelines and logging. These are described in the JavaDoc, and listed on the page for uk.gov.dstl.baleen.runner.Baleen.

Sensible defaults are assumed where an explicit setting is not provided in the configuration.

An example configuration is provided below. Note that spaces should be used rather than tabs.

logging:
  loggers:
    - name: console
      minLevel: INFO
      excludeLoggers:
      - org.eclipse.jetty
      - org.apache.pdfbox
    - name: errors.log
      minLevel: WARN

pipelines:
  - name: Test Pipeline
    file: test_pipeline.yaml

Configuring Pipelines

Pipelines are also configured via YAML files, which can be specified at start up in the config.yml file (as above), or loaded into Baleen via the Baleen REST API (see the REST API documentation built into Baleen).

Pipeline configuration files are split into sections describing each of the UIMA analysis engine types (e.g. collection readers, annotators and CAS consumers), as well as a section for global variables. The format is described in more detail in the JavaDoc, on the page for uk.gov.dstl.baleen.core.pipelines.PipelineBuilder, and an example is given below:

collectionreader:
  class: FolderReader
  folders:
   - data

annotators:
  - regex.Email
  - regex.Url
  - class: regex.Mgrs
    ignoreDates: true

consumers:
  - Mongo

Pipelines that are specified in the config.yml are automatically started when Baleen starts, but pipelines created through the REST API need to be explicitly started through the REST API. Once running, a pipeline will continue to run until it is stopped by the user; it will continually poll the collection reader for new documents. So once the above example pipeline is running, documents can be processed through it at any time by placing them in a folder being watched by the FolderReader.

Running the Example

To run the above example, use the following command (if you already have Baleen running from earlier, you will need to stop it running by pressing Ctrl+C in the terminal):

java -jar baleen-2.x.0.jar config.yml

This will process any documents in the data folder (relative to your current working directory) and output into a Mongo database. You will need to have a local copy of Mongo running on port 27017 (this is the default), and there is plenty of information on how to do that available online.

To look at the entities it's found, use the following commands in Mongo

use baleen
db.entities.find().pretty()