Adding a New Database

External databases are used extensively in Baleen; to load documents from for processing (in collection readers), to load processing information from (e.g. gazetteers in annotator) or to save data to (in consumers). They can also be used for support services such as logging, history and metrics. Baleen has inbuilt database support for Mongo, Elasticsearch and Postgres, but developers may want to add support for additional databases.

Additional databases are implemented as BaleenResources. As a shared resources they are available throughout a pipeline which offers developers options for connection pooling and resource management to avoid overloading remote database servers. See Developing SharedResources for further information.

A typical pattern for implementing a database as a BaleenResource will be to form a database connection in doInitialize() and close that connection in doDestroy().

Specific uses of database in Baleen

We now consider how to implement resources in the different use cases. In each case the database should be injected into the pipeline as a BaleenResource.

In all cases thought should be given to storage and query patterns, and appropriate indices created on the database to maintain pipeline throughput. Developers should avoid hard coding any configuration for connecting to databases.

Collection reader

Extending BaleenCollectionReader, the database should be polled in doHasNext(). If new data exists to process, it should be stored in the instance and the corresponding item marked as 'in processing' (or even deleted) from the database. This supports clustering (multiple version of Baleen polling the same database) to avoid two instance of Baleen processing the same data item. This 'get and change' should be performed within a single transaction (if transactions exist in the database).

The retrieved item (that is temporarily stored in the instance) should be processed and returned through doGetNext().

Annotators

Use of a database in an annotator will be very dependent on that annotator. Thought should be given to whether to cache the database's content in memory to avoid querying the database for each document. If data is cached then it should be periodically refreshed (or ideally refreshed on change).

Consumers

Consumers will write the results of the pipeline (a UIMA JCas object) into the database.What they store and the format they store should match the business requirements for onward processing of data.

Thought should be given to:

Should existing documents be replaced if they have the same externalId (i.e. a document has been reprocessed)
How history information should be stored
Batching of data (for fast and high volume pipelines)

History

History implementations should implement DocumentHistory and BaleenHistory.

For a typical database, AbstractBaleenHistory and AbstractBaleenDocumentHistory provide starting points for implementation, with AbstractCachingBaleenHistory offering the ability of managing a local in-memory cache while a document is being processed through the pipeline.

Where possible database changes should be saved live (that is saved to the database as they happen). This will not only reduce the amount of state in memory but also allow better diagnosis of errors.

Any functions which get items from the database (getHistory() on DocumentHistory) should be made performant and get minimal data.

Provide feedback

Saved searches