Skip to content

Upgrading Between Versions

James Baker edited this page May 8, 2019 · 8 revisions

Upgrading 2.6.0 to 2.7.0

Refer to What's New in Baleen 2.7.0 for a detailed description of what's new in Baleen 2.7.0.

In particular, review the section on Content Extractors for information on how to upgrade to Baleen 2.7.0.

Upgrading 2.3.0 to 2.4.0

For a full list of changes in Baleen 2.4.0, see What's New in Baleen 2.4.0.

Pipeline Ordering

As of Baleen 2.4.0, annotators and consumers in a pipeline will (by default) attempt to self-order. This removes the requirement for pipeline developers to have such an in depth knowledge of the various annotators, but may not perform as well as an expert configured pipeline.

You can disable this feature by adding the following to the pipeline configuration:

orderer: uk.gov.dstl.baleen.core.pipelines.orderers.NoOpOrderer

Changes to Annotators

All annotators are now required to implement the getAction() method. If you only use the core annotators (i.e. the ones that come bundled with Baleen) then these have already been updated. If you use third party annotators, then you will need to upgrade to use a version compatible with Baleen 2.4.

Changes to Jobs

Job configuration files no longer require the top level job block. So what was previously:

job:
  schedule: Once
  tasks:
  - MongoStats

Should now be written as:

schedule: Once
tasks:
- MongoStats

Upgrading 2.2.0 to 2.3.0

Changes to Type System

Baleen 2.3.0 introduces some fairly large changes to the TypeSystem, in particular to the Temporal aspects. The following classes have been replaced with a new Temporal type:

  • DateTime
  • DateType
  • Time
  • TimeSpan

This will affect both the outputs of Baleen (and may impact on downstream tools that use Baleen), as well as the required configuration. The following annotators will need removing from existing configurations and replacing with new annotators.

  • cleaners.AddTimeSpans - no longer required as temporal entities inherently support spans
  • cleaners.CleanDates - replace with cleaners.CleanTemporal
  • cleaners.NormalizeDates - replace with cleaners.NormalizeTemporal
  • cleaners.NormalizeTimes - replace with cleaners.NormalizeTemporal
  • cleaners.RemoveNestedDateTimes - no longer required as cleaners.RemoveNestedEntites will now work correctly with temporal entities

Additionally, there are now some additional annotators that improve extraction of temporal entities - see the list of new annotators below.

The Temporal type has the following properties:

  • precision - EXACT, RELATIVE or UNQUALIFIED depending on the known precision of the temporal instance
  • scope - SINGLE or RANGE depending on the whether the entity represents a single temporal instance (e.g 2nd Feb 2017), or a range of temporal instances (2-12 Feb 2017)
  • temporalType - DATE, TIME or DATETIME depending on the type of the temporal instance
  • timestampStart - the Unix timestamp (inclusive) in seconds of the start of the temporal period being represented
  • timestampStop - the Unix timestamp (exclusive) in seconds of the end of the temporal period being represented

In addition, a new Weapon type has been added to the type system.

Removed components

In addition to the changes detailed above, the following components have been removed:

  • LegacyMongo (Consumer)

New and improved components

The following components have now been added to the standard Baleen build, and you may wish to include these in your configuration. For more information, view the relevant Javadoc.

  • ActiveMQReader (Collection Reader) - read documents from an ActiveMQ topic
  • ActiveMQ (Consumer) - publish outputs onto an ActiveMQ topic
  • cleaners.AddGenderToPerson - add gender information to Person entities
  • cleaners.EntityInitials - identify initials following an entity and associate these initials with the entity (including other occurrences)
  • cleaners.SplitBrackets - identify entities that include brackets and split the brackets into a separate coreferenced entity
  • misc.AddSourceToMetadata - add source information to the document as a Metadata annotation
  • regex.RelativeDate - identify temporal entities such as 'last Thursday', and resolve them where possible
  • regex.UnqualifiedDate - identify incomplete dates that can't be explicitly resolved (e.g. 2nd February)

The following components have been improved, and may now have additional functionality that you wish to use.

  • All gazetteers now support the subtype parameter, allowing you to assign subtype information to any entity from a gazetteer
  • MoveSource (Consumer) - source files can now be optionally moved to folders based on the document type