Skip to content
JohnDaws edited this page Aug 13, 2018 · 1 revision

Baleen 2.6 contains a number of tools to extend Baleen's functionality for document triage.

The Mallet project has been integrated for document classification and topic model generation.

Functionality for document summarisation has been added with the triage.WordDistributionDocumentSummary annotator.

Functionality for document prioritisation is introduced with the triage.ShannonEntropyAnnotator which uses Shannon Entropy as a measure of information content to prioritise documents with more information in fewer words.

Triage Examples

Topic model training

To use the categorisation annotators first a model must be created. Three new Baleen jobs have been created for this purpose.

  • triage.TopicModelTrainer for when there are no prior labels.
  • triage.MaxEntClassifierTrainer for when the classes are provided by the user and identified by a list of keywords that suggest the class.
  • triage.MalletClassifierTrainer to be used when there is existing classification data to use as training.

The trainer tasks read their corpus from Mongo. The collection and content field are configurable, but by default, it uses the standard Mongo consumer’s ‘documents’ collection and ‘content’ field. So a corpus can be produced simply by running a Baleen pipeline with an appropriate collection reader and the Mongo consumer; no other annotators are required.

Topic Model Training

The topic model can be generated using the following Baleen job, where the job is configured in topictraining.yml

java -jar baleen.jar -j topictraining.yml

topictraining.yml

mongo:
  db: baleen
  host: localhost

tasks:
- class: triage.TopicModelTrainer
  modelFile: ./models/topic.mallet
  numTopics: 10
# numIterations: 1000
# numThreads: 2
# collection: documents
# field: content

Maximum Entropy Training

The maximum entropy model is trained by providing user defined classes along with a set of keywords which define them. For example, "positive" and "negative" classes could be trained usein the maxent.yml job base on the the text file labels.txt which contains a label and a number of keywords defining it.

maxent.yml

mongo:
  db: baleen
  host: localhost

tasks:
- class: triage.MaxEntClassifierTrainer
  labelsFile: labels.txt
  modelFile: ./models/maxent.mallet
# numIterations: 1000
# variance: 1.0
# collection: documents
# field: content

labels.txt

positive good love amazing best awesome
negative not can't enemy horrible ain't

Clearly this is a very brief labels file and so will not produce a robust model.

The model can be trained as a Baleen job:

java -jar baleen.jar -j maxent.yml

Labelled Data

If you have labelled data then there are further Mallet classifiers that can be trained on the data. The trainer allows multiple classifiers to be trained in the same job, and can output an assessment of accuracy based on randomly partitioning the data for training and testing. Then the best performing model can be taken forward. We have added an example job to train multiple classifiers but labelled data must be loaded into Mongo to use it. The collection and labelfield within Mongo are configurable, but default to 'documents' and 'labels' respectively.

The following job can be used to train the model on this data

java -jar baleen.jar -j classify.yml

configured by the classify.yml pipeline file:

classify.yml

mongo:
  db: baleen
  host: localhost

tasks:
- class: triage.MalletClassifierTrainer
  trainer: 
  - RandomAssignmentTrainer
  - NaiveBayes
  - DecisionTree,maxDepth=10
  - DecisionTree,maxDepth=20
  - DecisionTree,maxDepth=40
  - BalancedWinnow
  - MaxEnt
  forTesting: 0.2
  resultFile: ./models/classifyTrials.csv
  modelFile: ./models/classify
#  collection: documents
#  labelField: label

Note that this classifier produces a number of mallet models prefixed with "classify". The example below selects one and renames it "classify.mallet".

Running the Triage Annotators

Given suitably trained models the full set of triage annotators can be run using the following Baleen pipeline.

mongo:
  db: baleen
  host: localhost
  
collectionreader:
- class: FolderReader
  folders:
  - ./files/
  
annotators:
- language.OpenNLP
- class: triage.CommonKeywords
  stemming: ENGLISH
- class: triage.RakeKeywords
  stemming: ENGLISH
- class: triage.ShannonEntropyAnnotator
- class: triage.WordDistributionDocumentSummary
#  summaryCharacterCount: 100
- class: triage.TopicModel
  modelFile: ./models/topic.mallet
- class: triage.MalletClassifier
  modelFile: ./models/classify.mallet
- class: triage.MalletClassifier
  modelFile: ./models/maxent.mallet

  
consumers:
- Mongo