Developing Collection Readers pre Baleen 2.7
Since Baleen 2.7.0 it is not necessary to handle content extractors within individual collection readers. As such the example below is not compatible with Baleen from 2.7.0 but is preserved here for record. See Developing Collection Readers for up to date documentation.
Collection readers are the first component in a Baleen pipeline, and responsible for finding data to process. Generally, they will then pass this data to a content extractor to convert it into plain text, before passing this text onwards to the Annotator stage of the pipeline.
In this guide, we will be developing a collection reader to read files in a directory. For the purposes of this guide, we will not worry about changes to the folder or recursion; for examples of how this would work see the source code for FolderReader
.
As we are developing a new collection reader, we need to ensure we have a dependency on the baleen-collectionreaders module, as this will provide many of the base and utility classes that we will use as well as access to other common dependencies. To do this, we need to add the following to our POM file:
<dependency>
<groupId>uk.gov.dstl.baleen</groupId>
<artifactId>baleen-collectionreaders</artifactId>
<version>2.4.0</version>
</dependency>
To start with, let's create a new Java class called SimpleFolderReader
which extends BaleenCollectionReader
. The BaleenCollectionReader
class is an abstract class that does a lot of the behind-the-scenes work required by Baleen, but leaves us free to implement the logic of the collection reader. We will create it in the uk.gov.dstl.baleen.collectionreaders.guides
package to keep it separate from existing collection readers.
package uk.gov.dstl.baleen.collectionreaders.guides; import java.io.IOException; import org.apache.uima.UimaContext; import org.apache.uima.collection.CollectionException; import org.apache.uima.jcas.JCas; import org.apache.uima.resource.ResourceInitializationException; import uk.gov.dstl.baleen.uima.BaleenCollectionReader; public class SimpleFolderReader extends BaleenCollectionReader { @Override protected void doInitialize(UimaContext context) throws ResourceInitializationException { } @Override protected void doGetNext(JCas jCas) throws IOException, CollectionException { } @Override protected void doClose() throws IOException { } @Override public boolean doHasNext() throws IOException, CollectionException { return false; } }
There are four stub methods for us to populate, which we will do in the following sections.
The first thing we will want to do is initialise our collection reader using some user provided configuration. In our case, the configuration we are interested in is the content extractor to pass files to, and the folder in which to find the files. We will add both of these as configuration parameters at the top of the class.
/** * The folder containing files * * @baleen.config Current directory */ public static final String PARAM_FOLDER = "folder"; @ConfigurationParameter(name = PARAM_FOLDER, defaultValue = ".") private String folder; /** * The content extractor to use to extract content from files * * @baleen.config TikaContentExtractor */ public static final String PARAM_CONTENT_EXTRACTOR = "contentExtractor"; @ConfigurationParameter(name = PARAM_CONTENT_EXTRACTOR, defaultValue="TikaContentExtractor") private String contentExtractor = "TikaContentExtractor"; private IContentExtractor extractor; List<File> files;
Now that we have the configuration parameters provided by the user, along with some defaults in case they aren't provided, we can use these in the doInitialise()
function to initialise the collection reader. To keep things simple, here we don't worry too much about error catching or invalid configuration parameters.
@Override public void doInitialize(UimaContext context) throws ResourceInitializationException { //Initialise the content extractor using helper functions try{ extractor = getContentExtractor(contentExtractor); }catch(InvalidParameterException ipe){ throw new ResourceInitializationException(ipe); } extractor.initialize(context, getConfigParameters(context)); //Get a list of files in the folder File f = new File(folder); files = Arrays.asList(f.listFiles(new FileFilter() { @Override public boolean accept(File pathname) { return pathname.isFile(); } })); }
Because we created a content extractor during initialisation, we also need to destroy that instance properly during shutdown. This can be done in the doClose()
function.
@Override protected void doClose() throws IOException { if(extractor != null) { extractor.destroy(); extractor = null; } }
We now have an initialised collection reader with a list of files to be processed. This is done by populating two methods: doHasNext()
and doGetNext()
. The function doHasNext()
is regularly polled by the BaleenCollectionReader
to see whether there are new files to process. If it returns true, then doGetNext
is called. If it returns false, then there is a short delay (1 second by default) before it is called again to see if there are now files available for processing.
In our case, whether we have more files to process can be determined simply by looking at whether the List files
is empty or not.
@Override public boolean doHasNext() throws IOException, CollectionException { return !files.isEmpty(); }
In the doGetNext()
function, we need to pull a file from the files
list and pass it through to the content extractor. ContentExtractors accept a InputStream of the file to process and a String containing source information. The JCas object provided by Baleen is also provided for the content extractor to populate.
@Override protected void doGetNext(JCas jCas) throws IOException, CollectionException { //Check that we have a file to process if(files.isEmpty()){ getMonitor().error("No documents on the queue - this method should not have been called"); throw new CollectionException(); } //Remove the file from the list File f = files.remove(0); //Pass the file to the Content Extractor try( InputStream is = new FileInputStream(f); ){ extractor.processStream(is, f.getAbsolutePath(), jCas); } }
And that's it, we now have a collection reader that will find all of the files in a directory and pass them to a content extractor to parse. The complete code for the example is available below.
package uk.gov.dstl.baleen.collectionreaders.guides; import java.io.File; import java.io.FileFilter; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.util.Arrays; import java.util.List; import org.apache.uima.UimaContext; import org.apache.uima.collection.CollectionException; import org.apache.uima.fit.descriptor.ConfigurationParameter; import org.apache.uima.jcas.JCas; import org.apache.uima.resource.ResourceInitializationException; import uk.gov.dstl.baleen.exceptions.InvalidParameterException; import uk.gov.dstl.baleen.uima.BaleenCollectionReader; import uk.gov.dstl.baleen.uima.IContentExtractor; public class SimpleFolderReader extends BaleenCollectionReader { /** * The folder containing files * * @baleen.config Current directory */ public static final String PARAM_FOLDER = "folder"; @ConfigurationParameter(name = PARAM_FOLDER, defaultValue = ".") private String folder; /** * The content extractor to use to extract content from files * * @baleen.config TikaContentExtractor */ public static final String PARAM_CONTENT_EXTRACTOR = "contentExtractor"; @ConfigurationParameter(name = PARAM_CONTENT_EXTRACTOR, defaultValue="TikaContentExtractor") private String contentExtractor = "TikaContentExtractor"; private IContentExtractor extractor; List<File> files; @Override protected void doInitialize(UimaContext context) throws ResourceInitializationException { //Initialise the content extractor using helper functions try{ extractor = getContentExtractor(contentExtractor); }catch(InvalidParameterException ipe){ throw new ResourceInitializationException(ipe); } extractor.initialize(context, getConfigParameters(context)); //Get a list of files in the folder File f = new File(folder); files = Arrays.asList(f.listFiles(new FileFilter() { @Override public boolean accept(File pathname) { return pathname.isFile(); } })); } @Override protected void doGetNext(JCas jCas) throws IOException, CollectionException { //Check that we have a file to process if(files.isEmpty()){ getMonitor().error("No documents on the queue - this method should not have been called"); throw new CollectionException(); } //Remove the file from the list File f = files.remove(0); //Pass the file to the Content Extractor try( InputStream is = new FileInputStream(f); ){ extractor.processStream(is, f.getAbsolutePath(), jCas); } } @Override protected void doClose() throws IOException { if(extractor != null) { extractor.destroy(); extractor = null; } } @Override public boolean doHasNext() throws IOException, CollectionException { return !files.isEmpty(); } }
Now you've built your collection reader and want to include it in Baleen, all you need to do is ensure that your class is on the classpath, and then in your pipeline configuration include your class in the pipeline ensuring that you specify the full package and classname.