Skip to content

Relation Extraction

James Baker edited this page Oct 24, 2019 · 4 revisions

Additional relationship extraction functionality has been added to Baleen 2.6. This includes some simple high recall low precision relationship extraction annotators based on the co-occurance of entities in sentences or documents, a number of pattern based algorithms, and also a more complex annotator based on the ReNoun paper.

Simple Relationship Extraction

Two new annotators have been generated which assign a relationship between entities based on co-occurance in a sentence or document.

  • DocumentRelationshipAnnotator assigns a relationship to entities which occur in the same document but different sentences within a configurable sentence distance. Additionally it may be configured to relationships between specific types. It adds a "sentence distance" to the annotation which may later be used to assign a confidence to the relation.
  • SentenceRelationshipAnnotator assigns relationships to entities of configurable types which appear in the same sentence. These relations have a sentence distance of 0, but may also have values set by the word distance (number of words between entities) or the dependency distance (length of the shortest path between the two entities in the dependency graph).

Simple Relationship Example

A specialised MongoRelations consumer has been developed to quickly analyse the relationships derived from Sentence and Document co-occurance. This may used as per the simple example pipeline below which uses openNLP to detect people and locations and before assigning relationships and outputting to MongoRelations for analysis.

mongo:
  db: baleen_simple_relations
  host: localhost
  

collectionreader:
  class: FolderReader
  folders: ..\corpora\re3d

annotators:
  - language.OpenNLP
  - class: stats.OpenNLP
    model: ..\models\en-ner-person.bin
    type: uk.gov.dstl.baleen.types.common.Person
  - class: stats.OpenNLP
    model: ..\models\en-ner-location.bin
    type: uk.gov.dstl.baleen.types.semantic.Location
  - relations.DocumentRelationshipAnnotator
  - relations.SentenceRelationshipAnnotator

consumers:
  - MongoRelations

Pattern Based Relationship Extraction

The existing NPVNP and SimpleInteraction relationship annotators have been extended with a three new relationship annotators.

  • DependencyRelationshipAnnotator is a more restricted version of the SentenceRelationshipAnnotator but restricted to ensure that there is a dependency path between the two entities in the sentence.
  • RegExRelationAnnotator captures simple cases using regular expressions applied to words between entities. For example ( :Person: )\\s+(?:visit\\w*|went to)\\s+( :Location: ) would create a relationship between a person and a location that they visit(ed) or went to.
  • PartOfSpeechRelationshipAnnotator uses parts of speech in a customised regular expression system, for example ( NNP ). *( VBD ).* ( NNP ) will extract a proper noun followed by a past tense verb followed by another proper noun with any text in between.

See the relevant Javadoc for further information including the list of extended Penn Treebank tags that may be used with PartOfSpeechRelationshipAnnotator.

Note that these relationship extraction annotators require dependency parsing to function. For example the pipeline below will generate relationships between places and or locations which depend on each other within a sentence.

mongo:
  db: baleen_dependency_relations
  host: localhost
  

collectionreader:
  class: FolderReader
  folders: ..\corpora\re3d

annotators:
  - language.OpenNLP
  - language.MaltParser
  - class: stats.OpenNLP
    model: ..\models\en-ner-person.bin
    type: uk.gov.dstl.baleen.types.common.Person
  - class: stats.OpenNLP
    model: ..\models\en-ner-location.bin
    type: uk.gov.dstl.baleen.types.semantic.Location
  - relations.DependencyRelationshipAnnotator

consumers:
  - MongoRelations

ReNoun Based Fact Extraction

Baleen 2.6 contains an implementation of a system based on the ReNoun paper: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42849.pdf

The aim of the ReNoun system is to extract facts in the form (subject, attribute, object) where the attribute is expressed in noun form. For example,

  • (NPR, legal affairs correspondent, Nina Totenberg)
  • (Princeton, economist, Paul Krugman)
  • (Google, CEO, Larry Page). in Baleen these are expressed as relations of the form (subject, value, target).

The system has 4 stages in order to produce facts:

  • Seed Fact Extraction - this takes a list of valid attribute types and uses ‘handcrafted’ dependency patterns to extract a set of ‘known facts’ for bootstrapping.
  • Extract Pattern Learning - uses the ‘known facts’ to extract further patterns that would also extract the known facts from the corpus.
  • Fact Extraction - uses the learned patterns to extract more facts from the corpus.
  • Fact Scoring - assigns a measure of confidence in the extracted fact based on a scoring of the patterns used.

Each of these stages can be run as its own Baleen pipeline or job, using Mongo to facilitate communication of data between each stage.

Running the ReNoun System

Setup

This example assumes that the corpus is stored as text files in a folder ./files/ relative to the location of the baleen.jar file.

This example required Mongo to be running on localhost:27017

The scoring requires the glove word vector model, which can be downloaded from https://nlp.stanford.edu/projects/glove/ . For example download glove.6B.zip and unzip it to the folder ./models/ relative to baleen.jar.

If required, an 'attributes' file may be provided to limit the attributes to a specific set of nouns. An example is given below but this should be tuned to the corpus or not used to match all nouns.

attributes.txt

CEO
COO
CFO
Secretary
chief executive officer
chief operating officer
chief financial officer
Chief administrative officer
Chief analytics officer
Chief brand officer
Chief business development officer
Chief business officer, 
Chief commercial officer
Chief communications officer
Chief compliance officer
Chief creative officer
Chief customer officer
Chief data officer
Chief design officer
Chief digital officer
Chief diversity officer
Chief content officer
Chief events officer[1]
Chief executive officer
Chief experience officer
Chief financial officer
Chief gaming officer
Chief genealogical officer
Chief human resources officer
Chief information officer
Chief information officer (higher education)
Chief information security officer
Chief innovation officer
Chief investment officer
Chief knowledge officer
Chief learning officer
Chief legal officer
Chief marketing officer
Chief operating officer
Chief privacy officer
Chief process officer
Chief product officer
Chief reputation officer
Chief research officer
Chief restructuring officer
Chief Revenue Officer
Chief risk officer
Chief science officer
Chief Scientific Officer
Chief security officer
Chief services officer
Chief strategy officer
Chief sustainability officer
Chief technology officer
Chief visibility officer
Chief visionary officer
Chief web officer
director
chairman
chairperson
president
owner
treasurer
board member
father
mother
brother
sister
wife
husband
partner
captain
chief
producer
coach

Seed generation

The seed generation step can be used if the dependency parser of the model is changed from the default MaltParser. The seed generation pipeline should be configured in 0_seed_generation.yml.

0_seed_generation.yml

mongo:
  db: baleen-renoun
  host: localhost
 
# Supply the default document of fact sentences
collectionreader:
 class: renoun.ReNounSeedDocument

 
annotators:
# Ensure the language parsing is done in the pipeline
- language.OpenNLP
- language.MaltParser
     
# ReNoun Seed Fact Extraction
- class: renoun.ReNounSeedGenerator
  outputCollection: seedPatterns
 
# Save relations to Mongo
consumers:
- class: MongoRelations
  collection: seeds

This pipeline is run using:

java -jar baleen.jar -p 0_seed_generation.yml

Seed extraction

This pipeline extracts seed facts using a set of hand crafted patterns for the given attributes. There is also an option to use all nouns that match the patterns as attributes, if a target attribute list is not known. The seed facts are stored as relations in Mongo. These should be sanity checked and verified removing any that are not valid before moving on to the next stage.

The seed extraction pipeline should be configured in 1_generated_seed_extraction.yml if the seed generation step was run.

1_generated_seed_extraction.yml

mongo:
  db: baleen-renoun
  host: localhost
 
# Read your corpus here
collectionreader:
  - class: FolderReader
  folders:
  - ./files/


 
annotators:
# Ensure the language parsing is done in the pipleine 
- language.OpenNLP
- language.MaltParser

# Perform your usual entity extraction here e.g.
# ...
# ...

# ReNoun Seed Fact Extraction
- class: renoun.ReNounGeneratedSeedsRelationshipAnnotator
  collection: seedPatterns
#  attributesFile: attributes.txt 

 
# Save relations to Mongo
consumers:
- class: MongoRelations
  collection: seedFacts

This pipeline is run using:

java -jar baleen.jar -p 1_generated_seed_extraction.yml

If the seed generation step was skipped then the default seeds can be used by using the 1_default_seed_extraction.yml pipleline file.

1_default_seed_extraction.yml

mongo:
  db: baleen-renoun
  host: localhost
 
# Read your corpus here
collectionreader:
  - class: FolderReader
  folders:
  - ./files/


 
annotators:
# Ensure the language parsing is done in the pipleine 
- language.OpenNLP
- language.MaltParser

# Perform your usual entity extraction here e.g.
# ...
# ...

# ReNoun Seed Fact Extraction
- class: renoun.ReNounDefaultSeedsRelationshipAnnotator
  collection: seedPatterns
#  attributesFile: attributes.txt 

 
# Save relations to Mongo
consumers:
- class: MongoRelations
  collection: seedFacts

This pipeline is run using:

java -jar baleen.jar -p 1_default_seed_extraction.yml

pattern learning

The attribute list (if supplied) and the (refined) seed facts are used by this pipeline to generate more patterns that would have extracted these facts. These patterns are stored in mongo.

Pattern learning can be configured in 2_pattern_learning.yml

2_pattern_learning.yml

mongo:
  db: baleen-renoun
  host: localhost
 
# Read your corpus here
collectionreader:
  - class: FolderReader
  folders:
  - ./files/

annotators:
- language.OpenNLP
- language.MaltParser
 
# ReNoun Pattern Learning
- class: renoun.ReNounPatternDataGenerator
  collection: seedFacts
# outputCollection: custom 

This pipeline is run using:

java -jar baleen.jar -p 2_pattern_learning.yml

Fact Extraction

Using the extended set of patterns more facts/relations are extracted from the corpus to give the noun based relations. These are stored as relations in mongo and (optionally) in a specific collection for scoring.

this pipeline is configured in 3_fact_extraction.yml

3_fact_extraction.yml

mongo:
  db: baleen-renoun
  host: localhost
 
# Read your corpus here
collectionreader:
  - class: FolderReader
  folders:
  - ./files/
 
annotators:
# Ensure the language parsing is done in the pipleine (done in default here)
- language.OpenNLP
- language.MaltParser

# Perform your usual entity extraction here e.g.
# ...
# ...

# ReNoun Fact Extraction
- class: renoun.ReNounRelationshipAnnotator
  factCollection: renoun_facts
#  attributeFile: ./renoun/attributes 

 
# Save relations to Mongo
consumers:
- class: Mongo
  outputHistory: true
- class: MongoRelations

This pipeline is run using:

java -jar baleen.jar -p 3_fact_extraction.yml

Fact Scoring

This optional post process can score the facts to give you more information about the confidence you should have in the extracted fact. It is configured in 4_fact_scoring.yml

4_fact_scoring.yml

mongo:
  db: baleen-renoun
  host: localhost
  
tasks:
- class: renoun.ReNounScoring
  factCollection: renoun_facts
  model: ./models/glove.6B.300d.txt
  

which can be run as a Baleen job using:

java -jar baleen.jar -j 4_fact_scoring.yml