Skip to content

Knowledge Representation

JohnDaws edited this page Sep 5, 2018 · 5 revisions

A number of new consumers have been developed for Baleen 2.6.

Analysis consumers

analysis.elsticsearch and analysis.mongo have been created to allow simplified analytical queries to Baleen output in the retrospective databases. Additionally these consumers are designed for use with the Jonah visualisation tool.

These databases can be run from docker with the following commands:

docker run -d -p 27017:27017 mongo:3
docker run -d -p 9200:9200 -p 9300:9300 -e "http.host=0.0.0.0" -e "transport.host=0.0.0.0" -e "xpack.security.enabled=false" -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms750m -Xmx750m" docker.elastic.co/elasticsearch/elasticsearch:5.6.8 

Elasticsearch consumers

In addition to the existing Elasticearch consumers new consumers have been created for temporal and geographic searching, namely LocationElasticsearch and TemporalElasticsearch.

LocationElasticsearch consumer that creates an index of documentId to the Elasticsearch types geo_point and geo_shape . This allows for quick and scalable geo queries and aggregations such as geohash heatmaps.

TemporalElasticsearch creates an index of documentId to all Temporal mentions. This will facilitate the quick responses to queries about time and date ranges, as well as aggregations for timelines and date histograms. This uses the date type for ‘single’ precision and the date_range for ‘range’ precision mentions. Relative time mentions are not included as they do not have a fixed point in time to reference.

Note that the meta-time (the time a document is authored or published) is included in the main Elasticsearch document as part of the metadata.

Graphs

A significant new development for Baleen 2.6 is the ability to output two different graph structures for entities that are linked by coreference, relationships or events. The first, Document Graph, represents the annotations in the document and faithfully stores all the relevant information from the Content Information level. From this graph, any reasonable graph representation can be derived through filtering, aggregation or path short-cutting. The second, Entity Graph, is a derived graph representing the higher level entity and relation information. In this case the reference target nodes from the document are mapped to entity nodes using a configurable mapping of attributes from the associated mentions.

The following graph based consumers have been implemented:

  • print.documentGraph and print.entityGraph - to log the graph output as GraphML or JSON
  • file.documentGraph and file.entityGraph - to write to a file in GraphML, JSON or the Kyro binary format
  • Neo4JDocumentGraphConsumer and Neo4JEntityGraphConsumer - to write to Neo4j using the Bolt protocol (https://boltprotocol.org/.
  • DocumentGraphConsumer and EntityGraphConsumer - to write to Tinkerpop supported graph databases. See https://github.com/mohataher/awesome-tinkerpop for a list of supported graph databases. The required graph driver may need to be added to the classpath.

See Baleen graph and RDF examples for a more detailed description and examples.

Resource Description Framework (RDF)

Baleen's output data can be output using a simple OWL schema based on the Document and Entity graph structures defined above using the file.Rdf or file.RdfEntityGraph consumers as follows.

consumers:
- class: file.Rdf
  outputDirectory: ./output_rdf
  format:RDF_XML
- class: file.RdfEntityGraph
  outputDirectory: ./output_entity_rdf
  format: RDF_XML

Where the supported formats are:

  • RDF_XML - Standard RDF XML serialisation5
  • TURTLE - Terse RDF Triple Language. Output is similar in form to SPARQL
  • RDF_XML_ABBREV - Abbreviated RDF XML serialisation
  • N_TRIPLES - Each line is a triple in the form "Subject Predicate Object ."
  • RDF_JSON - A JSON representation of the RDF, see https://jena.apache.org/documentation/io/rdf-json.html
  • JSONLD - JSON for Linked Data, see https://json-ld.org/
  • N3 - Notation3, a Human readable triple format.

Alternatively Baleen can output to external triple stores that support SPARQL Graph Store. For example Baleen can output to Fuseki (running locally with collection named 'baleen') using:

consumers:
- class: rdf.RdfDocumentGraphConsumer
  query: http://localhost:3030/baleen/query
  update: http://localhost:3030/baleen/update
  store: http://localhost:3030/baleen/data

See Baleen graph and RDF examples for a more detailed example of output to Fuseki.