Skip to content

Releases: dstl/baleen

Version 2.7.0

08 May 19:56
bb88621
Compare
Choose a tag to compare

Version 2.7 incorporates a number of changes and improvements, including some breaking changes.

New Functionality

  • Ability to include other YAML files within YAML configuration files
  • Annotator regex.Mgrs now adds GeoJSON to extracted coordinates
  • New annotator regex.NaiveParagraph to naively annotate paragraphs based on multiple new lines
  • New annotator triage.TokenFrequencySummarisation to use a token frequency approach to document summarisation
  • New options on CsvFolderReader collection reader to add line numbers and reprocess files that are modified

Updates and Bug Fixes

  • Code quality improvements based on feedback from Codacy
  • Integration with CI tools
  • Set ContentType on Elasticsearch REST requests
  • Support for both Java 8 and newer versions (Java 9+, tested against Java 11)
  • Update dependencies to newer versions
  • Update underlying framework to UimaFIT 3
  • Use synchronous requests in Plankton to avoid race conditions
  • Minor bugfixes, typos, etc

Breaking Changes

  • Content Extractors are now a first class citizen in Baleen, and as such have their own section in pipeline configuration files. Existing pipeline files will need changing, otherwise the content extractor may be incorrectly configured. For more information, see What's New in Baleen 2.7.0.

For a complete list of changes, see the Git commit log.

Version 2.6.0

14 Aug 16:31
Compare
Choose a tag to compare

This release provides new functionality including new annotators, new consumers, additional functionality for event extraction, relationship extraction and document triage, support for horizontal and vertical scaling and more.

Of note, two new consumers analysis.mongo and analysis.elasticsearch have been added which allow exploitation of Baleen's output within Jonah

For full details, see What's New in Baleen 2.6.0.

The majority of changes should be backwards compatible with Baleen 2.4.0 however, the Elasticsearch consumer has been upgraded from version 2 to to 5 (tested on 5.6.4.) This is likely to be a breaking change and will require Elasticsearch servers to be upgraded. However the ElasticsearchRest consumer should still work with Elasticsearch 2.

Version 2.4.0

05 May 08:25
Compare
Choose a tag to compare

This release contains a large number of changes, improvements and new features - including new annotators, an updated type system, self ordering pipelines, structure extraction, templating, and a whole lot more!

For full details, see What's New in Baleen 2.4.0. For upgrade instructions, see Upgrading Between Versions.

Version 2.3.0

01 Feb 16:33
Compare
Choose a tag to compare

The following is a summary of the new features and changes in Baleen 2.3.0. There may be additional changes - refer to the diff and commit log for full details.

Since the previous release, the following changes have been made.

  • New core features
    • Removed old temporal types (i.e. DateType, DateTime, Time, TimeSpan) and replaced with a new Temporal Type
    • Added Weapon to type system
    • New REST API to enumerate type system
  • New components
    • ActiveMQ support (SharedResource, CollectionReader, Consumer)
    • AddGenderToPerson cleaner
    • AddSourceToMetadata cleaner
    • EntityInitials cleaner
    • SplitBrackets cleaner
  • Improved components
    • Gazetteers now support subtype
    • MoveSource consumer can now move files to a folder based on type
    • Normalisation of Elasticsearch consumers
    • Fix to correctly watch subfolders in FolderReader
  • Bug fixes, improved unit testing, updated dependencies and reductions to technical debt

Please be aware that some aspects of this release may not be backwards compatible with previous versions. Refer to the wiki for information on upgrading between versions.

Version 2.2.0

02 Jun 17:11
Compare
Choose a tag to compare

The following is a summary of the new features and changes in Baleen
2.2.0. There may be additional changes and features. Please refer to the
diff and commit logs for full details.

New core features

  • All entities now have a sub-type
  • Added gender to Person
  • Baleen Jobs framework
  • Plankton visual pipeline tool

New collection readers and improvements to existing collection readers

  • EmailReader
  • FolderReader now accepts a regular expression to filter against, rather than a file extension
  • MucReader
  • ReutersReader

New annotators and improvements to existing annotators

  • Added nautical miles to Distance regex
  • CorefBrackets cleaner (replaces CorefLocationCoordinate cleaner)
  • Coreference annotators and sieves
  • Improvements to LatLon annotator
  • Interaction annotators
  • Keyword extraction annotators (RakeKeywords and CommonKeywords)
  • Relationship annotators
    • NPVNP
    • SimpleInteraction
    • UbmreConstituent
    • UbmbreDependency
  • Rewrite of MoneyRegex to fix issues with previous version
  • USTelephone

New consumers and improvements to existing consumers

  • CSV Consumers
  • Elasticsearch upgraded to Elasticsearch 2
  • ElasticsearchRest
  • MongoPatternSaver
  • Print consumers to output information to the console

New jobs

  • Interactions jobs
  • MongoStats

New resources

  • SharedStopwordResource
  • SharedWordNetResource

Bug fixes, improved unit testing, updated dependencies and reductions to
technical debt

Please be aware that some aspects of this release may not be backwards
compatible with previous versions.

Version 2.1.0

11 Dec 12:18
Compare
Choose a tag to compare

This version includes the following improvements:

  • New Annotator: MongoStemming uses a gazetteer and stemming to perform
    a pseudo-fuzzy match and find gazetter terms in different tenses and
    plurals
  • New Cleaner: MergeAdjacent will merge adjacent entities of the same
    type
  • New Content Extractor: CsvContentExtractor splits CSV fields into
    content and metadata
  • New Collection Reader: LineReader will read a single file into
    multiple documents by line
  • New REST API to get configuration parameters for components (e.g.
    annotators)
  • Significant changes to the way gazetteer annotators work, including
    changing from RadixTrees to MultiMaps and implementing the Aho-Corasick
    algorithm, resulting in performance improvements for large gazetteers in
    the order of 100s
  • Lots of bug fixes and minor improvements

Initial Open Source Release

28 Sep 07:53
Compare
Choose a tag to compare

This is the initial open source release of Baleen, v2.0.0.

For more information, please refer to the README.md.