Skip to content

Latest commit

 

History

History
77 lines (53 loc) · 15.2 KB

ARCHITECTURE.md

File metadata and controls

77 lines (53 loc) · 15.2 KB

Architecture Overview

Table of contents

Introduction

The Charaparser framework allows for different types of executions and is highly configurable to the a very detailed processing level. To configure Charaparser for a specific execution the dependency injection framework Guice is utilized.

On a high level, it can for example be executed to

  • markup descriptions [0]
  • to evaluate created markup [1]
  • or others [2, 3, 4, ...].

On another level it may for example be configured to markup descriptions of type

  • morphology [5]
  • habitat [6]
  • elevation [7]
  • phenology [8]
  • ecology [9]
  • distribution [10].

Also the strategies used to markup any of these types can be varied, up to very fine granular details, such as which measurement units are supported (cm, dm, m, etc.) [11].

The JavaDoc of the project can be found here.

Entry-point, Configuration, Run

The following two diagrams illustrate the utilization of the injection framework. AbstractModule is a Guice class utilized to create the injection bindings. The hierarchy inheriting AbstractModule merely layers configuration options of CharaParser by detail level (i.e. BaseConfig [12] contains parameters that likely do not vary, while RunConfig [13] contains parameters that frequently change between runs) and finally taxon groups [14]. A command line interface (CLI) entry point [15] may parse [16] command line options and parameters to update the configuration for the run according to users inputs. Eventually a IRun will be created [17] to execute the desired run configuration.

Run and Guice Configuration ETCLearnMain

Learn

ETCLearnRun [18] is one IRun [19] example that is used for illustration in the following. For the markup of morphological descriptions it is required to learn the terminology used in the description. This is done by ETCLearnRun. To achieve this it will learn using OTOLearner [20]. OTOLearner reads the descriptions, e.g. from XML files and returns the model class DescriptionsFilesList [21]. It will then retrieve a glossary for the taxon group configured from OTO and subsequently initialize an in memory glossary with the retrieved data. An instance of ITerminologyLearner [22] will proceed with the actual learning of terminology from the DescriptionsFileList. Finally OTOLearner will send the the collection of terminology found for review and corrective relabeling to OTO2, the term categorization application.

ETCLearnRun

Markup

ETCLearnRun [23] is usually only run in conjunction with ETCMarkupRun [24], another IRun option. ETCMarkupRun utilizes a MarkupChain [25] that trigger the creation of markup for a configured set of description types.

ETCMarkupRun

Markup Creation

The currently available IMarkupCreator [26, 27, 28, 29, 30, 31, 32], on a high level, all very much behave the same way. They read the type of description which they are supposed to transform. They will do the transformation and they will write the transformed description back to the source, e.g. an XML file. To illustrate this take a look at the DescriptionMarkupCreator [33] (This is the markup creator for descriptions of type morphology). DescriptionMarkupCreator will loop over a set of configured IDescriptionTransformer [34] to create the transformed morphological description. For example, a first transformer may extract create biological entities and their characters from a piece of text, while a subsequent transformer may map biological entities found to IRIs in ontologies available.

DescriptionMarkupCreator

The following illustrates the markup creation specific to morphological descriptions. It is very much simpler for any of the other description types and thus not further discussed.

To dive deeper for morphological descriptions, let's take a look at the IDescriptionTransformer [35] that is MarkupDescriptionTreatmentTransformer [36]. This transformer will retrieve the glossary for the specific taxon group from OTO and additionally retrieve the reviewed term categorizations from the learn step discussed previously. It will initialize an in-memory glossary from these two data sources and learn using an instance of DatabaseInputNoLearner [37]. As the name suggested this is not actually learning terminology but rather initializing a required database. Finally for all descriptions to be transformed it will extract the description markup using a DescriptionExtractorRun [38].

MarkupDescriptionTreatmentTransformer

DescriptionExtractorRun [39] loops over all sentences and chunks each one of them resulting in a ChunkCollector [40]. Once complete, it extracts the marked up description using all the ChunkCollectors.

DescriptionExtractorRun

To do the chunking of a sentence, in detail, SentenceChunkerRun [41] normalizes, tokenizes, tags, and parses the sentence. In the last step a ChunkerChain [42] chunks the sentence resulting in a ChunkCollector [43].

SentenceChunkerRun

To extract the marked up description using all the ChunkCollectors [44] of a single description, in detail, the SomeDescriptionExtractor [45] first creates a ProcessingContext [46]. It will iterate all ChunkCollectors, each of them corresponding to one of the sentences in the description. It will initialize the ProcessingContext for the new sentence by resetting and setting the ChunkCollector. It will get and loop over all the chunks of the sentence, obtain the IChunkProcessor [47] from ChunkProcessorProvider [48] as adequate for the ChunkType [49] at hand and process it resulting in a set of Elements. This concludes the markup process of morphological descriptions on a high level.

SomeDescriptionExtractor