Using Relationship Extraction
As of Baleen v2.2, support for some relationship extraction has been added with a new set of annotators. This page will take you through configuring and using some of these annotators. This guide covers the use of the following relationship annotators:
relations.SimpleInteraction
relations.UbmreConstituent
relations.UbmreDependency
For full details of these annotators, and other relationship annotators not covered here, you should refer to the Javadoc. This guide is intended to get you started, but will not cover the intricacies of each annotator.
The annotators covered by this guide require some training prior to use. This should be performed on a representative training set of you data, that can be expected to include examples of all the relationships you would like to extract.
The first stage is to identify "patterns" within your training set, where a pattern is the text pattern between two annotations (in this case entities) which has been processed to be more meaningful than simply the covered text between them.
To do this, we need to first perform entity extraction on your data (ideally using the same annotators as you intend to use on the actual data). We then run a special pattern annotator and consumer to save these patterns to a Mongo database.
collectionreader: # Read in your training data here annotators: # Perform your usual entity extraction here ... # Pattern Extraction - patterns.PatternExtractor consumers: # Save patterns to Mongo - MongoPatternSaver
Now that we have patterns extracted from our training set, we need to convert them to "interactions". An Interaction is a word that acts as a relationship in a sentence, for instance "saw" in the sentence "John saw the car.".
This is done using a Baleen job, which reads in our Patterns from Mongo and converts them into a CSV file. This CSV should be manually checked after it has been created to remove any unwanted or spurious interactions.
tasks: - class: interactions.IdentifyInteractions filename: output/interactions.csv
Following the identification (and manual checking) of interactions, we can optionally run a job to enhance these interactions. This includes complementing the extracted interactions with synonyms. Again, the output CSV file should be manually checked following this stage to remove any unwanted or spurious enhancements.
tasks: - class: interactions.EnhanceInteractions input: output/interactions.csv output: output/interactions-enhanced.csv
Finally for the training, we need to upload our CSV of enhanced interactions back to Mongo. This is done through a Baleen job.
tasks: - class: interactions.UploadInteractionsToMongo input: output/interactions-enhanced.csv
Once we have completed the training stages, we are able to perform relationship extraction on our full data set. Potentially, the same trained data could be used for a variety of different data sets, but it is recommended that training is performed for each different data set to achieve optimum performance.
Relationship extraction is done with a few specific annotators, which would usually come at the end of the pipeline (after entity cleaners and coreference has been performed). The purpose of these annotators is to:
- Extract interactions in the document
- Clean up extracted interactions
- Perform relationship extraction based on the extracted interactions
collectionreader: # Read in your training data here annotators: # Perform your usual entity extraction, cleaning and coreference here ... # Interaction Extraction - class: gazetteer.MongoStemming collection: interactions type: Interaction # Clean Interactions - interactions.RemoveInteractionInEntities - interactions.AssignTypeToInteraction # Extract Relationships - relations.UbmreDependency # UbmreDependency is used here, but you could also use SimpleInteraction of UbmreConstituent consumers: # Persist to your data store