Release of version 2.2.0

The following is a summary of the new features and changes in Baleen 2.2.0. There may be additional changes and features. Please refer to the diff and commit logs for full details. New core features * All entities now have a sub-type * Added gender to Person * Baleen Jobs framework * Plankton visual pipeline tool New collection readers and improvements to existing collection readers * EmailReader * FolderReader now accepts a regular expression to filter against, rather than a file extension * MucReader * ReutersReader New annotators and improvements to existing annotators * Added nautical miles to Distance regex * CorefBrackets cleaner (replaces CorefLocationCoordinate cleaner) * Coreference annotators and sieves * Improvements to LatLon annotator * Interaction annotators * Keyword extraction annotators (RakeKeywords and CommonKeywords) * Relationship annotators * NPVNP * SimpleInteraction * UbmreConstituent * UbmbreDependency * Rewrite of MoneyRegex to fix issues with previous version * USTelephone New consumers and improvements to existing consumers * CSV Consumers * Elasticsearch upgraded to Elasticsearch 2 * ElasticsearchRest * MongoPatternSaver * Print consumers to output information to the console New jobs * Interactions jobs * MongoStats New resources * SharedStopwordResource * SharedWordNetResource Bug fixes, improved unit testing, updated dependencies and reductions to technical debt Please be aware that some aspects of this release may not be backwards compatible with previous versions.
dstl · Jun 1, 2016 · af9abb2 · af9abb2
1 parent 19e1689
commit af9abb2
Show file tree

Hide file tree

Showing 462 changed files with 29,913 additions and 4,616 deletions.
diff --git a/BUILD.md b/BUILD.md
@@ -11,4 +11,4 @@
 3. Right click on `baleen` project, select Run As... -> 3. Maven Build...
 4. Type `package` into the Goals box, and then click Run
 5. The Baleen JAR will be built and saved in the target directory under the top level project directory
-6. Run Baleen by running `java -jar baleen-2.0.0.jar` and then navigating to <http://localhost:6413>
+6. Run Baleen by running `java -jar baleen-2.2.0.jar` and then navigating to <http://localhost:6413>
diff --git a/README.md b/README.md
@@ -1,7 +1,5 @@
 # Baleen
 
-[![Join the chat at https://gitter.im/dstl/baleen](https://badges.gitter.im/dstl/baleen.svg)](https://gitter.im/dstl/baleen?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
-
 Baleen is an extensible text processing capability that allows entity-related information to be extracted from unstructured and semi-structured data sources. It makes available in a structured format things of interest otherwise stored in formats such as text documents - references to people, organisations, unique identifiers, location information.
 
 Baleen is written in [Java 8](http://www.oracle.com/java/javase/downloads/jre8-downloads-2133155.html) using the software project management tool [Maven 3](http://maven.apache.org) and draws heavily on the [Apache Unstructured Information Management Architecture (UIMA)](http://uima.apache.org) which provides a framework, components and infrastructure to handle unstructured information management.    
@@ -16,9 +14,9 @@ Baleen includes an in-built server, which hosts full documentation and guides on
 To get started, you will need to launch this server and read this documentation.
 To launch the server, run the following command.
 
-> java -jar baleen-2.1.0.jar
+> java -jar baleen-2.2.0.jar
 
-Once running, the server can be accessed at [http://localhost:6413](http://localhost:6413) 
+Once running, the server can be accessed at [http://localhost:6413](http://localhost:6413).
 
 If you require the Javadoc to be available through the in-built server, then you should place the Baleen Javadoc JAR in the same directory as the Baleen JAR.
 
@@ -84,4 +82,4 @@ Licensed under the ODC Public Domain Dedication and Licence (PDDL) 1.0 - [http:/
 
 ## OpenNLP Language Models
 
-Licensed under the Apache Software License 2.0 - [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)
+Licensed under the Apache Software License 2.0 - [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)
diff --git a/THIRD-PARTY.txt b/THIRD-PARTY.txt
diff --git a/baleen/baleen-annotators/pom.xml b/baleen/baleen-annotators/pom.xml
@@ -4,7 +4,7 @@
 	<parent>
 		<groupId>uk.gov.dstl.baleen</groupId>
 		<artifactId>baleen</artifactId>
-		<version>2.2.0-SNAPSHOT</version>
+		<version>2.2.0</version>
 	</parent>
 	<artifactId>baleen-annotators</artifactId>
 	<name>Baleen Annotators</name>
@@ -35,7 +35,11 @@
 			<artifactId>opennlp-tools</artifactId>
 			<version>${opennlp.version}</version>
 		</dependency>
-
+		<dependency>
+			<groupId>org.maltparser</groupId>
+			<artifactId>maltparser</artifactId>
+			<version>${maltparser.version}</version>
+		</dependency>
 		<dependency>
 			<groupId>org.apache.commons</groupId>
 			<artifactId>commons-lang3</artifactId>

diff --git a/...een-annotators/src/main/java/uk/gov/dstl/baleen/annotators/cleaners/AddTitleToPerson.java b/...een-annotators/src/main/java/uk/gov/dstl/baleen/annotators/cleaners/AddTitleToPerson.java
@@ -0,0 +1,73 @@
+package uk.gov.dstl.baleen.annotators.cleaners;
+
+import java.util.Collection;
+
+import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
+import org.apache.uima.fit.util.JCasUtil;
+import org.apache.uima.jcas.JCas;
+
+import uk.gov.dstl.baleen.annotators.grammatical.NPTitleEntity;
+import uk.gov.dstl.baleen.types.common.Person;
+import uk.gov.dstl.baleen.uima.BaleenAnnotator;
+
+/**
+ * Add title (mr, president, etc) information to previously found people.
+ * <p>
+ * Often with NLP models we find a person, e.g. John Smith but omit the title information, e.g.
+ * General John Smith, General Sir John Smith. This annotator adds that information back onto the
+ * entity, thus improving the quality of person extraction and reducing the number of unannotated
+ * words in a document.
+ *
+ * @baleen.javadoc
+ */
+public class AddTitleToPerson extends BaleenAnnotator {
+
+	@Override
+	protected void doProcess(JCas jCas) throws AnalysisEngineProcessException {
+		// We copy this array as we'll modify people as we go
+		Collection<Person> people = JCasUtil.select(jCas, Person.class);
+
+		for (Person p : people) {
+			while(makeReplacement(jCas, p)){
+				//Make as many replacements as possible, to capture things like Sir Major General Smith. 
+			}
+		}
+	}
+
+	private boolean makeReplacement(JCas jCas, Person p){
+		boolean replacementMade = false;
+
+		for(String title : NPTitleEntity.TITLES){
+			if(p.getBegin() - title.length() - 1 < 0)
+				continue;
+
+			String precedingText = jCas.getDocumentText().substring(p.getBegin() - title.length() - 1, p.getBegin() - 1);
+			if(title.equalsIgnoreCase(precedingText)){
+				p.setBegin(p.getBegin() - title.length() - 1);
+				p.setTitle(extendTitle(precedingText, p.getTitle()));
+
+				replacementMade = true;
+			}
+		}
+
+		return replacementMade;
+	}
+
+	/**
+	 * Add the prefix to the existing title.
+	 *
+	 * @param prefix
+	 *            the prefix
+	 * @param title
+	 *            the title
+	 * @return the string
+	 */
+	private String extendTitle(String prefix, String title) {
+		if (title == null || title.isEmpty()) {
+			return prefix;
+		} else {
+			return prefix + " " + title;
+		}
+	}
+
+}
diff --git a/...-annotators/src/main/java/uk/gov/dstl/baleen/annotators/cleaners/NaiveMergeRelations.java b/...-annotators/src/main/java/uk/gov/dstl/baleen/annotators/cleaners/NaiveMergeRelations.java
@@ -0,0 +1,133 @@
+package uk.gov.dstl.baleen.annotators.cleaners;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
+import org.apache.uima.fit.descriptor.ConfigurationParameter;
+import org.apache.uima.fit.util.JCasUtil;
+import org.apache.uima.jcas.JCas;
+
+import uk.gov.dstl.baleen.types.semantic.Entity;
+import uk.gov.dstl.baleen.types.semantic.Relation;
+import uk.gov.dstl.baleen.uima.BaleenAnnotator;
+
+/**
+ * Removes multiple copies of the same relation within a document.
+ *
+ * This is a naive and simple approach which can hide many issues - it is effectively performing
+ * relationship coreference and deduplication based solely at a relationship level. The algorithm
+ * works by looking is the relationship types are the same, and if the entities are the same (here
+ * as well is difficult, this is based on entities having the same type and value which may be
+ * incorrect for multiple John Smiths).
+ *
+ * This only really useful if you want to ensure that from a single document you get only a single
+ * relationship of the same type, subtype between the same two entities because you want to
+ * (naively) push data into database and not have to consider this in future algorithms (focusing on
+ * counting the same relations appearing in different documents).
+ *
+ */
+public class NaiveMergeRelations extends BaleenAnnotator {
+
+	/**
+	 * Symmetric relations (x ~ y and y ~ x are considered the same) if true
+	 *
+	 * @baleen.config true
+	 */
+	public static final String KEY_SYMMETRIC = "symmetric";
+	@ConfigurationParameter(name = KEY_SYMMETRIC, defaultValue = "true")
+	private Boolean symmetric;
+
+	@Override
+	protected void doProcess(final JCas jCas) throws AnalysisEngineProcessException {
+		final List<Relation> relations = new ArrayList<>(JCasUtil.select(jCas, Relation.class));
+
+		final Set<Relation> toRemove = new HashSet<>();
+
+		for (int i = 0; i < relations.size(); i++) {
+			final Relation a = relations.get(i);
+
+			if (!toRemove.contains(a)) {
+				toRemove.addAll(findSameRelations(a, relations.subList(i + 1, relations.size())));
+			}
+		}
+
+		removeFromJCasIndex(toRemove);
+	}
+
+	/**
+	 * Finds any relations from the list <em>relations</em> that is the same as <em>a</em>
+	 */
+	private List<Relation> findSameRelations(Relation a, List<Relation> relations){
+		return relations.stream().filter(b -> isSame(a, b)).collect(Collectors.toList());
+	}
+
+	/**
+	 * Checks if relations are the same.
+	 *
+	 * @param a
+	 *            the first relation
+	 * @param b
+	 *            the second relation
+	 * @return true, if is same
+	 */
+	private boolean isSame(final Relation a, final Relation b) {
+		boolean sameSourceTarget = false;
+		if(isSame(a.getSource(), b.getSource()) && isSame(a.getTarget(), b.getTarget())){
+			sameSourceTarget = true;
+		}else if(symmetric && isSame(a.getSource(), b.getTarget()) && isSame(a.getTarget(), b.getSource())){
+			//Symmetric, so source and target could be switched
+			sameSourceTarget = true;
+		}
+
+		return sameSourceTarget
+			&& isSame(a.getRelationshipType(), b.getRelationshipType())
+			&& isSame(a.getRelationSubType(), b.getRelationSubType());
+	}
+
+	/**
+	 * Checks if entity is the same
+	 *
+	 * @param a
+	 *            the first entity
+	 * @param b
+	 *            the second entity
+	 * @return true, if is same
+	 */
+	private boolean isSame(final Entity a, final Entity b) {
+		if (a == null && b == null) {
+			return true;
+		}
+
+		if (a == null || b == null) {
+			// implies b != null (as a != b)
+			return false;
+		}
+
+		// TODO: is the value test enough?
+		return a.getType().equals(b.getType()) && isSame(a.getValue(), b.getValue());
+	}
+
+	/**
+	 * Checks if two strings are the same.
+	 *
+	 * @param a
+	 *            first string
+	 * @param b
+	 *            second string
+	 * @return true, if is same
+	 */
+	private boolean isSame(final String a, final String b) {
+		if (a == null && b == null) {
+			return true;
+		} else if (a == null || b == null) {
+			return false;
+		} else {
+			return a.equalsIgnoreCase(b);
+		}
+	}
+
+}