Skip to content

Graph and RDF Examples

JohnDaws edited this page Aug 13, 2018 · 1 revision

This page shows brief examples to set up consumers for the various graph and RDF output options for Baleen. To generate a graph it is necessary to include annotators which extract relations, co-references or events.

These examples use Docker to host the external dependencies. Note that on Windows the Docker machines may not run on localhost and so references in this document and in the pipeline may need to be replaced with the Docker machine IP.

File based output

This example outputs the document graphs as graph_ML and RDF in RDF_XML format.

consumers:
- class: file.DocumentGraph
  outputDirectory: ./output_document_graph
  format: GRAPHML
- class: file.EntityGraph
  outputDirectory: ./output_entity_graph
  format: GRAPHML
- class: file.Rdf
  outputDirectory: ./output_rdf
  format: RDF_XML
- class: file.RdfEntityGraph
  outputDirectory: ./output_entity_rdf
  format: RDF_XML

Alternative formats for the graph outputs are:

  • GRAPHML - XML-based format
  • GRAPHSON - JSON-based format
  • GYRO - Kryo format (uses JVM object graphs)

Alternative formats for RDF output are:

  • RDF_XML - Standard RDF XML serialisation5
  • TURTLE - Terse RDF Triple Language. Output is similar in form to SPARQL
  • RDF_XML_ABBREV - Abbreviated RDF XML serialisation
  • N_TRIPLES - Each line is a triple in the form "Subject Predicate Object ."
  • RDF_JSON - A JSON representation of the RDF, see https://jena.apache.org/documentation/io/rdf-json.html
  • JSONLD - JSON for Linked Data, see https://json-ld.org/
  • N3 - Notation3, a Human readable triple format.

Neo4j

This example outputs the graph representation to the Neo4J graph database. You can run the service in Docker with the following command.

docker run -d -p 7474:7474 -p 7687:7687 neo4j:3.0

You must set a password for the root user neo4j in the UI at http://localhost:7474. The example assumes you set it to neopass but can be altered in the configuration below. Run the example with:

consumers:
#- class: graph.Neo4JDocumentGraphConsumer
- class: graph.Neo4JEntityGraphConsumer
  #closeAfterEveryDocument: true
  #url: bolt://localhost:7687
  #username: neo4j
  password: neopass
  filterFeatures:
   - isNormalised
  valueStrategy: 
   - gender
   - Mode
   - geoJson
   - Mode
   - type
   - Mode
   - relationshipType
   - Mode

Entity graph output to OrientDB

This demonstrates the ability to output the higher level entity graph to a graph database using the Apache Tinkerpop abstraction layer on top of OrientDB. This currently only works with the version 3 release candidate. This may be obsolete soon. To run OrientDB 3.0.0RC1 in Docker use:

docker run -d -p 2424:2424 -p 2480:2480 -e ORIENTDB_ROOT_PASSWORD=rootpwd orientdb:3.0.0RC1

You must create a database to use named baleen from the user interface on http://localhost:2480.

Note that this example requires the graph drivers to be on the classpath. This can be done if running from the code by added a maven dependency on

<dependency>
	<groupId>com.orientechnologies</groupId>
	<artifactId>orientdb-gremlin</artifactId>
	<version>3.0.0RC1</version>
</dependency>

and for convenience, these are commented on the baleen-graph/pom.xml.

Running from the command line you must download the jars from:

and include them on the classpath (for example in a folder named "orient").

Baleen can then be run with:

java -cp "baleen.jar:orient/*"  uk.gov.dstl.baleen.runner.Baleen

or on Windows

 java -cp "baleen.jar;orient/*"  uk.gov.dstl.baleen.runner.Baleen

(See Using-Third-Party-Components for more information on running Baleen with third party jars.)

The OrientDB consumer is added to the pipeline file as follows:

consumers:
- class: graph.EntityGraphConsumer
  graphConfig: ./graph/orient.properties 

or

consumers:
- class: graph.DocumentGraphConsumer
  graphConfig: ./graph/orient.properties 

where ./graph/orient.properties is a text file containing:

gremlin.graph=org.apache.tinkerpop.gremlin.orientdb.OrientGraph
orient-url:remote:localhost/baleen
orient-user=root
orient-pass=rootpwd

Note that on windows "localhost" may need to be replaced with the Docker machine IP address.

RDF output to file

Baleen's output data can be output using a simple OWL schema based on the Document and Entity graph structures defined above using the file.Rdf or file.RdfEntityGraph consumers as follows.

consumers:
- class: file.Rdf
  outputDirectory: ./output_rdf
  format:RDF_XML
- class: file.RdfEntityGraph
  outputDirectory: ./output_entity_rdf
  format: RDF_XML

Where the supported formats are:

  • RDF_XML - Standard RDF XML serialisation5
  • TURTLE - Terse RDF Triple Language. Output is similar in form to SPARQL
  • RDF_XML_ABBREV - Abbreviated RDF XML serialisation
  • N_TRIPLES - Each line is a triple in the form "Subject Predicate Object ."
  • RDF_JSON - A JSON representation of the RDF, see https://jena.apache.org/documentation/io/rdf-json.html
  • JSONLD - JSON for Linked Data, see https://json-ld.org/
  • N3 - Notation3, a Human readable triple format.

RDF output of the to Apache Fuseki

This demonstrates the ability to represent the extracted information as RDF and store in a triple store. To run this example you must have an instance of Fuseki running with admin password pw123 and you must create datasets named baleen_entity and 'baleen_document' through the user interface that can be accessed on localhost:3030 with credentials admin:pw123. You can run with Docker

docker run -d -p 3030:3030 -e ADMIN_PASSWORD=pw123 stain/jena-fuseki:3.6.0
consumers:
- class: rdf.RdfEntityGraphConsumer
  query: http://localhost:3030/baleen_entity/query
  update: http://localhost:3030/baleen_entity/update
  store: http://localhost:3030/baleen_entity/data
  filterFeatures:
   - isNormalised
- class: rdf.RdfDocumentGraphConsumer
  query: http://localhost:3030/baleen_document/query
  update: http://localhost:3030/baleen_document/update
  store: http://localhost:3030/baleen_document/data
  filterFeatures:
   - isNormalised