Skip to content

What's New in Baleen 2.7.0

James Baker edited this page May 8, 2019 · 4 revisions

Baleen 2.7.0 was released on 8th May 2019. The release notes for v2.7.0 lists all the changes in this release, but this page contains additional examples and information of significant changes.

Content Extractors are now first class Baleen objects

This enables content extractors to use Baleen Resources, which was not previously possible.

This is a potentially breaking change for third party collection readers which specifically depend of a content extractor

The yaml pipeline syntax has also changed but in this case backwards compatibility has been maintained, so existing pipeline yaml files should still be valid. For example the following two yaml files are equivalent and valid

New syntax

collectionreader:
  class: FolderReader
  folders: input
  
contentextractor: TikaContentExtractor

Old syntax

collectionreader:
  class: FolderReader
  folders: input
  contentExtractor: TikaContentExtractor

Using the new syntax content extractor parameters may be specified as

contentextractor:
  class: CsvContentExtractor
  contentColumn: 2

Extension of yaml "include" functionality

Baleen 2.6.0 introduced the ability to include common sets of pipeline entities within a nested yaml file, for example the following yaml file would include a list of annotators

annotators:
-include: path/to/my/annotators.yml

It is now also possible to include an entire map section, for example

include: collectionreader.yml

annotators:
- insert (or include) annotators here

consumers:
- insert (or include) consumers here

where collectionreader.yml contains

collectionreader:
  class: FolderReader
  folders: input