Home

Croatiae auctores Latini (CroALa) – exploration and documentation

Notes and explanations for various sub-projects.

Linguistic analysis of a subset of CroALa texts

Petar Soldo as a LiLa Erasmus intern at the Università Cattolica del Sacro Cuore, CIRCSE, Milan, Italy, Summer semester 2019/2020.

The subset as an XQuery variable:

declare variable $docs := ("aa-vv-supetarski.xml", "sisgor-g-prosopopeya.xml", "modr-n-navic.xml", 
"marulus-m-carmina008.xml", "sisgor-g-odae.xml", "bunic-j-de-r.xml", "tubero-comm-rhac.xml", 
"andreis-f-epist-nadasd.xml", "benesa-d_epigr03_croala5095251.croala-lat1.xml", 
"gradic-s-oratio.xml", "boskovic-r-ecl.xml", "kunic-r-hymnus-cererem.xml", "milasin-f-viator.xml");

Define a subset of CroALa files, copy it to another directory. Create a directory first. Then use the BaseX and XQuery script create-subset-from-selected-files.xq.
Alternatively, clone the croatiae-auctores-latini-textus repository, which already contains the subset
Create a database from the subset: createCroALaDBfromsubset.xq
Create a list of words in the subset: wordlist-from-subset-db.xq

Tokenize words and punctuation in original text (not metadata) of documents

Inside the TEI/text node of the document, tokenize all text nodes, wrap words in w tag and punctuation in pc
Skip all elements with @ana="editorial" attribute and attribute value
Replace the original TEI/text node with the updated node
Export the files into the subset-tokenized directory

The tasks 1-3 are performed by the XQuery script subset-tokenize-w-pc.xq. Task 4 is done by the script subset-export-files.xq

The algorithm outlined above uses a recursive function to distinguish between text() nodes and others:

declare function local:copy-nodes-filter-text($element) {
  if ($element[@ana="editorial" or name()="g"]) then $element
  else element { node-name($element) }
             { $element/@*,
               for $child in $element/node()
                  return if (not($child/self::text()))
                    then local:copy-nodes-filter-text($child)
                    else for $c in tokenize($child, "\s+") return local:tokenize-words-pc($c)
           }
 };

The actual tokenization is done with the following function:

declare function local:tokenize-words-pc($token){
  for $part in analyze-string($token, '\w+')/*
   return  if ($part/name()="fn:match") then element w { $part/string()}
      else element pc { $part/string()}
};

The analyze-string XQuery function is very important and useful.

Transform encoding for tokenization: the 'supplied' tag

The problem: the supplied tag is used on several levels, to mark a whole word supplied by editors, or a part of the word (beginning, middle, end). When just a part of the word is marked as supplied, tokenization will split the word in its parts.

The solution adopted for this project is to add a preparatory step and to remove the supplied tag from the subset documents.

At the same time, we also used the @scope attribute to distinguish types of supplied text (with values "verbum" for the whole word, "incipit" for the beginning, "medium" for the middle, and "finis" for the end).

The additional encoding is described in the TEI header:

<encodingDesc>
    <tagsDecl resp="#NJ">
        <namespace name="#benesa-d_epigr03_croala5095251.croala-lat1">
            <tagUsage gi="supplied">With attribute @scope=verbum: a whole word is supplied.
            With attribute @scope=incipit: beginning of the word is supplied.
            With attribute @scope=medium: letters in the middle of the are supplied.
            With attribute @scoep=finis: end of the word is supplied.
            This description is important for word tokenization.</tagUsage>
        </namespace>
    </tagsDecl>
</encodingDesc>

To remove the supplied tag, a new function is added to the subset-tokenize-w-pc.xq XQuery script (the function is modeled on the local:copy-nodes-filter-text described above):

declare function local:copy-nodes-filter-supplied($element) {
  if ($element[name()="supplied"]) then $element/text()
  else element { node-name($element) }
             { $element/@*,
               for $child in $element/node()
               return if (not($child/self::text()))
                    then local:copy-nodes-filter-supplied($child)
                    else for $c in tokenize($child, "\s+") return $c
           }
 };

The final XQuery now has two steps:

for $xml_nodeset in db:open("croalatextussubset")//*:text
return replace node $xml_nodeset with local:copy-nodes-filter-text(local:copy-nodes-filter-supplied($xml_nodeset))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Croatiae auctores Latini (CroALa) – exploration and documentation

Linguistic analysis of a subset of CroALa texts

Tokenize words and punctuation in original text (not metadata) of documents

Transform encoding for tokenization: the 'supplied' tag

Clone this wiki locally