Skip to content

What's New in Baleen 2.6.0

JohnDaws edited this page Aug 31, 2018 · 6 revisions

Baleen version 2.6.0 is was released in August 2018.

The major new functionality provided in Baleen 2.5.0-SNAPSHOT and 2.6.0-SNAPSHOT is outlined below and has been consolidated into the release of version 2.6.0. Note that version 2.5.0 was not formally released and has been skipped in the numbering to reflect the commit history of the project.

Summary of new functionality

Added in Baleen 2.5.0-SNAPSHOT

Baleen 2.5.0-SNAPSHOT contains the following new functionality and components:

  • New components
    • Annotators: CVE (Common Vulnerabilities and Exposures), Epoch Time, IPv6, Lenient URL
    • Collection Readers: CSV Folder, MBOX, SQL Cell, SQL DB Cell, SQL Row
    • Consumers: Elastic-Kibana, Gremlin
  • Upgrades Elasticsearch from version 2 to to 5.6.4
    • this is likely to be a breaking change and will require Elasticsearch servers to be upgraded. However the ElasticsearchRest consumer may still work with Elasticsearch 2 but this is untested.
  • Improved compatibility with Java 9

Added in Baleen 2.6.0-SNAPSHOT

Baleen 2.6.0-SNAPSHOT contains new functionality which provides:

  • Horizontal and Vertical Scaling

    • A reading pipeline writes the whole jCas to a transport system and other pipelines can then read from that system. These readers can be specified with a multiplicity property. The following transport systems are supported:
      • In Memory - for testing and in server transports
      • ActiveMQ (NB all ActiveMQ dependencies have been moved under baleen-activemq)
      • Kafka
      • RabbitMQ
      • Redis
  • Knowledge Representation

    • There are new features for graph representation and an alternative approach to the current Mongo and Elasticsearch consumers, tuned for analysis.
    • Document annotations are represented in the DocumentGraph and the EntityGraph concentrated on the coreferenced entities (i.e. the Baleen ReferenceTarget)
    • New consumers include:
      • LocationElasticsearch indexes Location annotations per document
      • TemporalElasticsearch indexes Temporal annotations per document
      • MongoRealations gives a separate representation of Relation annotations
      • MongoEvents gives a separate representation of Event annotations
      • analysis.Mongo and analysis.Elasticsearch are alternative representations in Mongo and Elasticsearch for analysis applications
      • baleen-graph allows the representation of the data as a (Tinkerpop) graph that can be output to file or supported graph databases
      • baleen-graph-neo4j adds specific support for neo4j using the bolt protocol
      • baleen-rdf allows the output of the graphs in RDF formats and triplestores.
  • Relation Extraction

    • Distance measures are added to existing Relation type
    • SentenceRelationshipAnnotator and DocumentRelationshipAnnotator provide baseline relationship content
    • Pattern-based annotators allow regular expression matching based on relations and mixing entities with parts of speech into the regular expression patterns
    • The relationship extraction system based on the ReNoun algorithm is added for noun based relationship extraction building on dependency pattern based extraction
  • Document Triage

    • Existing document triage annotators are collated into the triage namespace and extended with new annotators:
      • assign document date from title
      • create document summaries
      • compute the Shannon Entropy of the document
    • The Mallet library is integrated for document classification
      • learning from labelled documents
      • learning by suggestion
      • Latent Dirichlet approach
  • Entity Linking

    • A framework for entity linking is implemented to identify entities in a document against an external source.
    • Candidate suppliers from DBpedia and a Mongo document based supplier are provided along with a general matching algorithm using a bag of words approach. Note that tuned implementations may be needed for particular situations.
  • Event Extraction

    • Simple event extraction has been added
    • The Odin library has been integrated for rule based event extraction.
  • Pipeline specification

    • Specifying pipelines from command line
      • A baleen pipeline yaml file can be run at startup with the command line call java -jar baleen.jar -p myPipeline.yml. This provides a simpler mechanism than passing a config file which references the pipeline as a command line parameter
    • Yaml inclusion
      • It is now possible to include a yaml pipeline configuration file inside a yaml pipeline configuration file. For example to simplify the creation of pipelines which use a common set of annotators but differ in terms of collection reader and/or consumer. The syntax for this is -include: ./path/to/other/yaml.yml within a yaml pipeline file.

There are also a number of bug fixes and improvements to previous versions.

See the following pages for further information on the new functionality as well as examples configuration and pipeline files:

Many of the examples are drawn with thanks from Committed Software's Baleen Examples project.