GitHub - soras/osalausestaja: Clause segmenter for Estonian \ Eesti keele osalausestaja

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
lib		lib
src/ee/ut/soras		src/ee/ut/soras
test		test
LICENSE		LICENSE
build.properties		build.properties
build.xml		build.xml
readme.txt		readme.txt

Repository files navigation

=============================================================
   Osalausestaja: Clause Segmenter for Estonian
=============================================================

  Clause Segmenter is a program that splits long and complex natural 
 language sentences into smaller segments (clauses). For example, the 
 sentence "Mees, keda seal kohtasime, oli tuttav ja teretas meid." 
 will be split into following clauses:

     "[Mees, [keda seal kohtasime,] oli tuttav ja] [teretas meid.]"
     (in the example, clauses are surrounded by brackets)

   The algorithm mainly relies on punctuation, conjunction words, and
  finite verb forms on identifying the clause boundaries.
  For linguistic details/motivations behind the algorithm, see (Kaalep, 
  Muischnek 2012).

=========================
   Requirements
=========================
 For building the program (JAR file):
  ** Java JDK (at least version 1.8.x is expected);
  ** Apache Ant (at least version 1.8.2);
  
 For using the program:
  ** A sentence segmentator;
  ** A word tokenizer;
  ** Estonian morphological analyzer, possible options:
     -- Filosoft Vabamorf: https://github.com/Filosoft/vabamorf
     -- PyVabamorf:        https://github.com/estnltk/pyvabamorf
     -- T3MESTA (a commercial morphological analyzer);
  ** Estonian morphological disambiguator;
     -- Vabamorf's disambiguator: 
        https://github.com/Filosoft/vabamorf
     NB! The clause segmenter also works on morphologically ambiguous 
        input, but the quality of the analysis is expected to be lower 
        than in the case of morphologically disambiguated text.

=========================
   Building the program
=========================
   The most straightforward way for compiling the program is by using
  Apache Ant and the build script ("build.xml" in root dir);
  
   Before building, correct path to JDK must be set in the file 
  "build.properties" (variable "java.home.location"). Then, building
  and deploying can be evoked with the command:

      ant deploy

  (in the same directory where "build.xml" is located);
   This compiles the Java source code, makes the JAR file (Osalau.jar), 
  and copies the JAR file along with required files into the folder 
  "test";
  
=========================
   Using the program
=========================

    Basic usage
   ---------------
   Before the clause segmenter can be applied on a text, a number of 
  text preprocessing steps must be made: text must be split into 
  sentences and tokens (words), and words must be morphologically 
  analysed (and disambiguated).
   Core of these functionalities is provided by EstNLTK toolkit, so the 
  easiest way to use the clause segmenter is within this toolkit ( see 
  https://github.com/estnltk/estnltk   for more details ).

   It is expected that the input of the clause segmenter is in the same 
  format as the output of Vabamorf's command 'etana analyze' - a JSON 
  structured text in UTF8 encoding. Note that the clause segmenter expects 
  that word root analyses are 'clean', without any phonetic markup symbols 
  (which can be optionally added in 'etana' with flag '-phonetic').

  An example of JSON input can be found in file "test/example_input.json";
  In the "test" folder, following command evokes clause segmenter on
  the input file "example_input.json" and outputs the results to standard 
  input:
  
     java -jar Osalau.jar -in file example_input.json -pretty_print

  (flag "-pretty_print" switches on the pretty printing mode, otherwise, 
  all of the output JSON is on single line);

  Alternatively, output can also be directed to a file by specifying:

     java -jar Osalau.jar -in file example_input.json -pretty_print -out file my_output.json

  Flag "-pyvabamorf" evokes the program in a special standard input/output 
  processing mode, where the program reads a line from the standard input,
  analyzes the line, and outputs the results (in a single line) to the standard 
  output.

     java -jar Osalau.jar -pyvabamorf

   More details about Vabamorf and its JSON format:   
        https://github.com/Filosoft/vabamorf 


    The "insensitive to missing commas" mode
   -------------------------------------------
    The clause segmenter can also be executed in the mode in which the program 
   attempts to be less sensitive to missing commas while detecting clause boundaries. 
   
    The flag "-ins_comma_mis" can be used to switch this mode on:
    
     java -jar Osalau.jar -in file example_input_missing_commas.json -pretty_print -ins_comma_mis
   
    (the file "example_input_missing_commas.json" can be found in the folder "test");
    
     Note that this mode is experimental and compared to the default mode, it can 
   introduce additional incorrect clause boundaries, although it also improves clause 
   boundary detection in texts with (a lot of) missing commas.
   
============================
   Interpreting the output
============================

  The clause segmenter marks clause boundaries: boundaries between regular 
 clauses, and start and end positions of embedded clauses. 
 
  In JSON input/output format, the clause boundary is indicated by adding object 
 'clauseAnnotation' to the token (at the same level as objects 'text' and 
 'analysis'). The 'clauseAnnotation' (which is a list of strings) can contain 
 three types of boundary markings:
    KINDEL_PIIR -- indicates that there is a clause boundary AFTER current 
                   token: one clause ends and another starts;
    KIILU_ALGUS -- marks a beginning of a new embedded clause BEFORE current 
                   token;
    KIILU_LOPP  -- marks ending of an embedded clause AFTER current token;

  Example:
    The sentence
       "Mees, keda seal kohtasime, oli tuttav ja teretas meid."

    will obtain following clause annotations:
        {'words': [ {'analysis': [ ... ],
                      'text': 'Mees,'},
                     {'analysis': [ ... ],
                      'clauseAnnotation': ['KIILU_ALGUS'],
                      'text': 'keda'},
                     {'analysis': [ ... ],
                      'text': 'seal'},
                     {'analysis': [ ... ],
                      'clauseAnnotation': ['KIILU_LOPP'],
                      'text': 'kohtasime,'},
                     {'analysis': [ ... ],
                      'text': 'oli'},
                     {'analysis': [ ... ],
                      'text': 'tuttav'},
                     {'analysis': [ ... ],
                      'clauseAnnotation': ['KINDEL_PIIR'],
                      'text': 'ja'},
                     {'analysis': [ ... ],
                      'text': 'teretas'},
                     {'analysis': [ ... ],
                      'text': 'meid.'} ]}

    which should be interpreted as:
          "keda" (KIILU_ALGUS) -- an embedded clause begins before "keda";
          "kohtasime," (KIILU_LOPP) -- the embedded clause ends after "kohtasime,";
          "ja" (KINDEL_PIIR)   -- one clause ends after "ja" and another begins;
          
    so, the corresponding clause structure should look like:
       "[Mees, [keda seal kohtasime,] oli tuttav ja] [teretas meid.]"
       (clauses are surrounded by brackets)
       
  Note that embedded clauses can contain other clauses and other embedded 
 clauses, and so the whole clause structure has a recursive nature.

=========================
   References
=========================

  *) Kaalep, Heiki-Jaan; Muischnek, Kadri (2012). Osalausete tuvastamine 
     eestikeelses tekstis kui iseseisev ülesanne. Helle Metslang, Margit 
     Langemets, Maria-Maren Sepper (Toim.). Eesti Rakenduslingvistika Ühingu 
     aastaraamat (55 - 68). Tallinn: Eesti Rakenduslingvistika Ühing;

  *) Kaalep, Heiki-Jaan; Muischnek, Kadri (2012). Robust clause boundary 
     identification for corpus annotation. Nicoletta Calzolari, Khalid Choukri, 
     Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mar (Toim.). 
     Proceedings of the Eight International Conference on Language Resources 
     and Evaluation (LREC'12) (1632 - 1636). Istanbul, Türgi: ELRA;