Skip to content
JohnDaws edited this page Aug 9, 2018 · 1 revision

Baleen 2.6 introduces functionality for co-reference resolution across documents, in particular providing the ability to link entities mentioned in the text to real world entities. This extends the existing functionality of the SieveCoreference annotators which cluster mentions into the same reference as each other but do not relate to the real world entity.

In order to link to real world entities this functionality requires the following 3 components (interfaces):

  • Information Collector - Collects (an opinionated set of) the information that is relevant to the entity linking task. For example, this may collect all the mentions of the entities and the sentences they occur in.
  • Candidate Supplier - Supplies a list of candidate matches from the datastore the entities are to be linked to. This is likely to require bespoke implementation for the deployment as it must link to the specific datastore and use the correct fields to lookup candidates.
  • Candidate Ranker - Ranks the candidates supplied. Different implementations can work in different ways using the information collected about each entity and information supplied from the datastore.

Information Collector

Two Information Collectors are implemented

  • JCasInformationCollector retrieves Entities from the JCas to be processed in the subsequent stages and only depends on prior entity extraction and coreference of the given entity type.
  • ProperNounInformationCollector uses part of speech tagging to restrict the search to proper nouns. This is the default as this performs better, but it does require part of speech tagging to be in the pipeline (for example, using language.OpenNLP annotator).

Candidate Supplier

Two types of Candidate Suppliers are implemented. MongoCandidateSupplier uses Mongo to store the entities, allowing anyone without an existing datastore to start one to support entity linking. It also helps to support a use case where a datastore exists but the user is not able to implement a candidate supplier. They would only have to ingest the data into Mongo, and there are many tools to support such a data migration that do not require any coding. The other uses DBpedia as the data source and has the suppliers: DBPediaPersonCandidateSupplier, DBPediaLocationCandidateSupplier and DBPediaOrganisationCandidateSupplier.

Candidate Ranker

There are many ways that the candidates may be ranked, and information provided by a bespoke datastore may be able to offer better rankings than a general solution.

Examples

To run an example pipeline that includes coreferencing using DBPedia, add the following to your pipeline (which already includes entity extraction and part of speech tagging):

- class: coreference.EntityLinkingAnnotator
  entityType: Person
# informationCollector: ProperNounInformationCollector
  candidateSupplier: dbpedia.DBPediaPersonCandidateSupplier
# candidateRanker: BagOfWordsCandidateRanker

This will use the DBPediaPersonCandidateSupplier to search for any people indexed on DBPedia, based on Entities provided by the ProperNounInformationCollector. If any Candidates are found, a linking value will be added to the persisted entity.

To run an example pipeline that includes coreferencing using a pre-defined Mongo database, add the following to your pipeline:

- class: coreference.EntityLinkingAnnotator
  entityType: Person
# informationCollector: ProperNounInformationCollector
  candidateSupplier: mongo.MongoCandidateSupplier
# candidateRanker: BagOfWordsCandidateRanker
  candidateSupplierArguments: [
    "databaseName", "known_entities",
  # "port", "27017", # Note port is a String
    "searchField", "name",
    "collection", "people",
  # "idField", "_id"
]

Note that for the MongoCandidateSupplier, the candidateSupplierArguments is needed, even though it is not set as mandatory in the EntityLinkingAnnotator class. The mandatory values in this array are “databaseName”, “collection” and “searchField”