Developing Collection Readers
This documentation refers to Baleen 2.7 and later. For a worked example for Baleen 2.6 or earlier, see Developing Collection Readers (pre Baleen 2.7)
Collection readers are the first component in a Baleen pipeline, and responsible for finding data to process. Generally, they will then pass this data to a content extractor to convert it into plain text, before passing this text onwards to the Annotator stage of the pipeline.
In this guide, we will be developing a collection reader to read files in a directory. For the purposes of this guide, we will not worry about changes to the folder or recursion; for examples of how this would work see the source code for FolderReader
.
As we are developing a new collection reader, we need to ensure we have a dependency on the baleen-collectionreaders module, as this will provide many of the base and utility classes that we will use as well as access to other common dependencies. To do this, we need to add the following to our POM file:
<dependency>
<groupId>uk.gov.dstl.baleen</groupId>
<artifactId>baleen-collectionreaders</artifactId>
<version>2.4.0</version>
</dependency>
To start with, let's create a new Java class called SimpleFolderReader
which extends BaleenCollectionReader
. The BaleenCollectionReader
class is an abstract class that does a lot of the behind-the-scenes work required by Baleen, but leaves us free to implement the logic of the collection reader. We will create it in the uk.gov.dstl.baleen.collectionreaders.guides
package to keep it separate from existing collection readers.
package uk.gov.dstl.baleen.collectionreaders.guides; import java.io.IOException; import org.apache.uima.UimaContext; import org.apache.uima.collection.CollectionException; import org.apache.uima.jcas.JCas; import org.apache.uima.resource.ResourceInitializationException; import uk.gov.dstl.baleen.uima.BaleenCollectionReader; public class SimpleFolderReader extends BaleenCollectionReader { @Override protected void doInitialize(UimaContext context) throws ResourceInitializationException { } @Override protected void doGetNext(JCas jCas) throws IOException, CollectionException { } @Override protected void doClose() throws IOException { } @Override public boolean doHasNext() throws IOException, CollectionException { return false; } }
There are four stub methods for us to populate, which we will do in the following sections.
The first thing we will want to do is initialise our collection reader using some user provided configuration. In our case, the configuration we are interested in is the content extractor to pass files to, and the folder in which to find the files. We will add both of these as configuration parameters at the top of the class.
/** * The folder containing files * * @baleen.config Current directory */ public static final String PARAM_FOLDER = "folder"; @ConfigurationParameter(name = PARAM_FOLDER, defaultValue = ".") private String folder; List<File> files;
Now that we have the configuration parameters provided by the user, along with some defaults in case they aren't provided, we can use these in the doInitialise()
function to initialise the collection reader. To keep things simple, here we don't worry too much about error catching or invalid configuration parameters.
@Override public void doInitialize(UimaContext context) { //Get a list of files in the folder File f = new File(folder); files = Arrays.asList(f.listFiles(new FileFilter() { @Override public boolean accept(File pathname) { return pathname.isFile(); } })); }
In this example, we do not need to do anything in the doClose method, but cannot omit it as it is an abstract method of the BaleenCollectionReader
class.
@Override protected void doClose() throws IOException { if(extractor != null) { extractor.destroy(); extractor = null; } }
We now have an initialised collection reader with a list of files to be processed. This is done by populating two methods: doHasNext()
and doGetNext()
. The function doHasNext()
is regularly polled by the BaleenCollectionReader
to see whether there are new files to process. If it returns true, then doGetNext
is called. If it returns false, then there is a short delay (1 second by default) before it is called again to see if there are now files available for processing.
In our case, whether we have more files to process can be determined simply by looking at whether the List files
is empty or not.
@Override public boolean doHasNext() throws IOException, CollectionException { return !files.isEmpty(); }
In the doGetNext()
function, we need to pull a file from the files
list and pass it through to the content extractor. ContentExtractors accept a InputStream of the file to process and a String containing source information. The JCas object provided by Baleen is also provided for the content extractor to populate.
@Override protected void doGetNext(JCas jCas) throws IOException, CollectionException { //Check that we have a file to process if(files.isEmpty()){ getMonitor().error("No documents on the queue - this method should not have been called"); throw new CollectionException(); } //Remove the file from the list File f = files.remove(0); //Pass the file to the Content Extractor try( InputStream is = new FileInputStream(f); ){ extractContent(is, f.getAbsolutePath(), jCas); } }
And that's it, we now have a collection reader that will find all of the files in a directory. The complete code for the example is available below.
package uk.gov.dstl.baleen.collectionreaders.guides; import java.io.File; import java.io.FileFilter; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.util.Arrays; import java.util.List; import org.apache.uima.UimaContext; import org.apache.uima.collection.CollectionException; import org.apache.uima.fit.descriptor.ConfigurationParameter; import org.apache.uima.jcas.JCas; import org.apache.uima.resource.ResourceInitializationException; import uk.gov.dstl.baleen.exceptions.InvalidParameterException; import uk.gov.dstl.baleen.uima.BaleenCollectionReader; public class SimpleFolderReader extends BaleenCollectionReader { /** * The folder containing files * * @baleen.config Current directory */ public static final String PARAM_FOLDER = "folder"; @ConfigurationParameter(name = PARAM_FOLDER, defaultValue = ".") private String folder; List<File> files; @Override protected void doInitialize(UimaContext context){ //Get a list of files in the folder File f = new File(folder); files = Arrays.asList(f.listFiles(new FileFilter() { @Override public boolean accept(File pathname) { return pathname.isFile(); } })); } @Override protected void doGetNext(JCas jCas) throws IOException, CollectionException { //Check that we have a file to process if(files.isEmpty()){ getMonitor().error("No documents on the queue - this method should not have been called"); throw new CollectionException(); } //Remove the file from the list File f = files.remove(0); //Pass the file to the Content Extractor try( InputStream is = new FileInputStream(f); ){ extractContent(is, f.getAbsolutePath(), jCas); } } @Override protected void doClose() throws IOException { } @Override public boolean doHasNext() throws IOException, CollectionException { return !files.isEmpty(); } }
Now you've built your collection reader and want to include it in Baleen, all you need to do is ensure that your class is on the classpath, and then in your pipeline configuration include your class in the pipeline ensuring that you specify the full package and classname.
This collection reader will use the default content extractor to ingest the data from each file. This can be configured within the pipeline if a different content extractor should be used.