Skip to content
James Baker edited this page May 8, 2017 · 6 revisions

Below is a sample pipeline configuration including some of the most common annotators. This pipeline will read from a folder called C:\baleen\data, and output into a Mongo database and an Elasticsearch index. The configuration for these two persistance stores is included below, although the default parameters are used so this information is optional. It is assumed that the relevant OpenNLP models have been downloaded and placed in the models directory.

This pipeline should be considered a sample only, and it is strongly advised that pipelines are tailored to individual corpora to achieve the best results. This can be done either by manually creating a pipeline configuration, or using the Plankton tool that is built into Baleen.

This pipeline is written for Baleen 2.4. Be aware that Baleen 2.4 will automatically order the pipeline, and so the order the annotators are listed in below is not the order they will run in.

mongo:
  db: baleen
  host: localhost

elasticsearch:
  cluster: elasticsearch
  host: localhost

collectionreader:
  class: FolderReader
  folders:
  - C:\baleen\data

annotators:
- cleaners.AddGenderToPerson
- cleaners.AddTitleToPerson
- cleaners.CleanPunctuation
- cleaners.CleanTemporal
- cleaners.CollapseLocations
- cleaners.CorefBrackets
- cleaners.CorefCapitalisationAndApostrophe
- cleaners.CurrencyDetection
- cleaners.EntityInitials
- cleaners.ExpandLocationToDescription
- cleaners.MergeAdjacent
- cleaners.MergeAdjacentQuantities
- cleaners.MergeNationalityIntoEntity
- cleaners.NaiveMergeRelations
- cleaners.NormalizeOSGB
- cleaners.NormalizeTemporal
- cleaners.NormalizeWhitespace
- cleaners.ReferentToEntity
- cleaners.RelationTypeFilter
- cleaners.RemoveLowConfidenceEntities
- cleaners.RemoveNestedEntities
- cleaners.RemoveNestedLocations
- cleaners.RemoveOverlappingEntities
- cleaners.SplitBrackets
- cleaners.Surname
- coreference.SieveCoreference
- gazetteer.Country
- gazetteer.File
- class: gazetteer.Mongo
  type: Buzzword
  collection: buzzwords
- class: gazetteer.Mongo
  type: Location
  collection: location
- class: gazetteer.Mongo
  type: Organisation
  collection: organisations
- class: gazetteer.Mongo
  type: Person
  collection: people
- grammatical.NPAtCoordinate
- grammatical.NPElement
- grammatical.NPLocation
- grammatical.NPOrganisation
- grammatical.NPTitleEntity
- grammatical.QuantityNPEntity
- grammatical.TOLocationEntity
- language.OpenNLP
- class: misc.DocumentTypeByLocation
  baseDirectory: C:\baleen\data
- misc.GenericMilitaryPlatform
- misc.GenericVehicle
- misc.GenericWeapon
- misc.MentionedAgain
- misc.NationalityToLocation
- misc.OrganisationPersonRole
- misc.People
- misc.Pronouns
- regex.Area
- regex.BritishArmyUnits
- regex.Callsign
- regex.CasRegistryNumber
- regex.Date
- regex.DateTime
- regex.Distance
- regex.DocumentNumber
- regex.Dtg
- regex.Email
- regex.FlightNumber
- regex.Frequency
- regex.Hms
- regex.IpV4
- regex.LatLon
- regex.Mgrs
- regex.Money
- regex.Nationality
- regex.Osgb
- regex.Postcode
- regex.RelativeDate
- regex.SocialMediaUsername
- regex.TaskForce
- regex.Telephone
- regex.Time
- regex.TimeQuantity
- regex.USTelephone
- regex.UnqualifiedDate
- regex.Url
- regex.Volume
- regex.Weight
- class: relations.NPVNP
  onlyExisting: true
- stats.DocumentLanguage
- class: stats.OpenNLP
  model: models/en-ner-location.bin
  type: Location
- class: stats.OpenNLP
  model: models/en-ner-organization.bin
  type: Organisation
- class: stats.OpenNLP
  model: models/en-ner-person.bin
  type: Person

consumers:
- Mongo
- Elasticsearch

For a full list of all the annotators, collection readers and consumers available, see the Wiki documentation, the included Javadoc, or the REST API.