Developing Collection Readers pre Baleen 2.7

Note

Since Baleen 2.7.0 it is not necessary to handle content extractors within individual collection readers. As such the example below is not compatible with Baleen from 2.7.0 but is preserved here for record. See Developing Collection Readers for up to date documentation.

Developing Collection Readers for Baleen 2.6 and earlier

Collection readers are the first component in a Baleen pipeline, and responsible for finding data to process. Generally, they will then pass this data to a content extractor to convert it into plain text, before passing this text onwards to the Annotator stage of the pipeline.

In this guide, we will be developing a collection reader to read files in a directory. For the purposes of this guide, we will not worry about changes to the folder or recursion; for examples of how this would work see the source code for FolderReader.

Configuring Dependencies

As we are developing a new collection reader, we need to ensure we have a dependency on the baleen-collectionreaders module, as this will provide many of the base and utility classes that we will use as well as access to other common dependencies. To do this, we need to add the following to our POM file:

<dependency>
    <groupId>uk.gov.dstl.baleen</groupId>
    <artifactId>baleen-collectionreaders</artifactId>
    <version>2.4.0</version>
</dependency>

Creating the Class

To start with, let's create a new Java class called SimpleFolderReader which extends BaleenCollectionReader. The BaleenCollectionReader class is an abstract class that does a lot of the behind-the-scenes work required by Baleen, but leaves us free to implement the logic of the collection reader. We will create it in the uk.gov.dstl.baleen.collectionreaders.guides package to keep it separate from existing collection readers.

package uk.gov.dstl.baleen.collectionreaders.guides;

import java.io.IOException;

import org.apache.uima.UimaContext;
import org.apache.uima.collection.CollectionException;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;

import uk.gov.dstl.baleen.uima.BaleenCollectionReader;

public class SimpleFolderReader extends BaleenCollectionReader {
	@Override
	protected void doInitialize(UimaContext context) throws ResourceInitializationException {

	}

	@Override
	protected void doGetNext(JCas jCas) throws IOException, CollectionException {

	}

	@Override
	protected void doClose() throws IOException {

	}

	@Override
	public boolean doHasNext() throws IOException, CollectionException {
		return false;
	}
}

There are four stub methods for us to populate, which we will do in the following sections.

Initialisation and Clean Up

The first thing we will want to do is initialise our collection reader using some user provided configuration. In our case, the configuration we are interested in is the content extractor to pass files to, and the folder in which to find the files. We will add both of these as configuration parameters at the top of the class.

/**
 * The folder containing files
 * 
 * @baleen.config Current directory
 */
public static final String PARAM_FOLDER = "folder";
@ConfigurationParameter(name = PARAM_FOLDER, defaultValue = ".")
private String folder;

/**
 * The content extractor to use to extract content from files
 * 
 * @baleen.config TikaContentExtractor
 */
public static final String PARAM_CONTENT_EXTRACTOR = "contentExtractor";
@ConfigurationParameter(name = PARAM_CONTENT_EXTRACTOR, defaultValue="TikaContentExtractor")
private String contentExtractor = "TikaContentExtractor";

private IContentExtractor extractor;
List<File> files;

Now that we have the configuration parameters provided by the user, along with some defaults in case they aren't provided, we can use these in the doInitialise() function to initialise the collection reader. To keep things simple, here we don't worry too much about error catching or invalid configuration parameters.

@Override
public void doInitialize(UimaContext context) throws ResourceInitializationException {
	//Initialise the content extractor using helper functions
	try{
		extractor = getContentExtractor(contentExtractor);
	}catch(InvalidParameterException ipe){
		throw new ResourceInitializationException(ipe);
	}
	extractor.initialize(context, getConfigParameters(context));

	
	//Get a list of files in the folder
	File f = new File(folder);
	files = Arrays.asList(f.listFiles(new FileFilter() {
		@Override
		public boolean accept(File pathname) {
			return pathname.isFile();
		}
	}));
}

Because we created a content extractor during initialisation, we also need to destroy that instance properly during shutdown. This can be done in the doClose() function.

@Override
protected void doClose() throws IOException {
	if(extractor != null) {
		extractor.destroy();
		extractor = null;
	}
}

Reading the File

We now have an initialised collection reader with a list of files to be processed. This is done by populating two methods: doHasNext() and doGetNext(). The function doHasNext() is regularly polled by the BaleenCollectionReader to see whether there are new files to process. If it returns true, then doGetNext is called. If it returns false, then there is a short delay (1 second by default) before it is called again to see if there are now files available for processing.

In our case, whether we have more files to process can be determined simply by looking at whether the List files is empty or not.

@Override
public boolean doHasNext() throws IOException, CollectionException {
	return !files.isEmpty();
}

In the doGetNext() function, we need to pull a file from the files list and pass it through to the content extractor. ContentExtractors accept a InputStream of the file to process and a String containing source information. The JCas object provided by Baleen is also provided for the content extractor to populate.

@Override
protected void doGetNext(JCas jCas) throws IOException, CollectionException {
	//Check that we have a file to process
	if(files.isEmpty()){
		getMonitor().error("No documents on the queue - this method should not have been called");
		throw new CollectionException();
	}

	//Remove the file from the list
	File f = files.remove(0);
	
	//Pass the file to the Content Extractor
	try(
		InputStream is = new FileInputStream(f);
	){
		extractor.processStream(is, f.getAbsolutePath(), jCas);
	}
}

Conclusion

And that's it, we now have a collection reader that will find all of the files in a directory and pass them to a content extractor to parse. The complete code for the example is available below.

package uk.gov.dstl.baleen.collectionreaders.guides;

import java.io.File;
import java.io.FileFilter;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
import java.util.List;

import org.apache.uima.UimaContext;
import org.apache.uima.collection.CollectionException;
import org.apache.uima.fit.descriptor.ConfigurationParameter;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;

import uk.gov.dstl.baleen.exceptions.InvalidParameterException;
import uk.gov.dstl.baleen.uima.BaleenCollectionReader;
import uk.gov.dstl.baleen.uima.IContentExtractor;

public class SimpleFolderReader extends BaleenCollectionReader {
	/**
	 * The folder containing files
	 * 
	 * @baleen.config Current directory
	 */
	public static final String PARAM_FOLDER = "folder";
	@ConfigurationParameter(name = PARAM_FOLDER, defaultValue = ".")
	private String folder;

	/**
	 * The content extractor to use to extract content from files
	 * 
	 * @baleen.config TikaContentExtractor
	 */
	public static final String PARAM_CONTENT_EXTRACTOR = "contentExtractor";
	@ConfigurationParameter(name = PARAM_CONTENT_EXTRACTOR, defaultValue="TikaContentExtractor")
	private String contentExtractor = "TikaContentExtractor";

	private IContentExtractor extractor;
	List<File> files;
	
	@Override
	protected void doInitialize(UimaContext context) throws ResourceInitializationException {
		//Initialise the content extractor using helper functions
		try{
			extractor = getContentExtractor(contentExtractor);
		}catch(InvalidParameterException ipe){
			throw new ResourceInitializationException(ipe);
		}
		extractor.initialize(context, getConfigParameters(context));
		
		//Get a list of files in the folder
		File f = new File(folder);
		files = Arrays.asList(f.listFiles(new FileFilter() {
			@Override
			public boolean accept(File pathname) {
				return pathname.isFile();
			}
		}));
	}

	@Override
	protected void doGetNext(JCas jCas) throws IOException, CollectionException {
		//Check that we have a file to process
		if(files.isEmpty()){
			getMonitor().error("No documents on the queue - this method should not have been called");
			throw new CollectionException();
		}

		//Remove the file from the list
		File f = files.remove(0);
		
		//Pass the file to the Content Extractor
		try(
			InputStream is = new FileInputStream(f);
		){
			extractor.processStream(is, f.getAbsolutePath(), jCas);
		}
	}

	@Override
	protected void doClose() throws IOException {
		if(extractor != null) {
			extractor.destroy();
			extractor = null;
		}
	}

	@Override
	public boolean doHasNext() throws IOException, CollectionException {
		return !files.isEmpty();
	}
}

Now you've built your collection reader and want to include it in Baleen, all you need to do is ensure that your class is on the classpath, and then in your pipeline configuration include your class in the pipeline ensuring that you specify the full package and classname.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly