Skip to content

LanguageMachines/foliautils

Repository files navigation

Build Status Language Machines Badge DOI GitHub release Project Status: Active – The project has reached a stable, usable state and is being actively developed.

FoliAutils

(c) CLST/TiCC/CLiPS 2024 https://github.com/LanguageMachines/foliautils

Centre for Language and Speech Technology, Radboud University Nijmegen
Tilburg centre for Cognition and Communication, Tilburg University and
Centre for Dutch Language and Speech, University of Antwerp

This file is part of foliautils foliautils provides a series of programs to make FoLiA processsing more easy.

This includes:

  • FoLiA-2text : convert FoLiA documents into plain text.

  • FoLiA-txt : convert plain text documents into FoLiA.

  • FoLiA-page : convert PAGE documents into FoLiA.

  • FoLiA-abby : convert Abbyy finereader documents into FoLiA.

  • FoLiA-hocr : convert hocr documents into FoLiA.

  • FoLiA-alto : convert ALTO DIDL files into series of FoLiA documents.

  • FoLiA-langcat : assign language tags to the words in a FoLiA document.

  • FoLiA-idf : count words in a serie of FoLiA documents and generate a .tsv files describing the IDF.

  • FoLiA-stats : gather n-gram statistics from series of FoLiA files.

  • FoLiA-collect : collect n-gram statistics of .tsv files produced by FoLiA-stats.

  • FoLiA-clean : cleanup FoLiA documents, removing unused declarations etc.

  • FoLia-pm : convert Political Mashup documents into FoLiA.

  • FoliA-correct : correct FoLiA files using correction candidates generated by TICCL-rank. (from the ticcltools package)

foliautils is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

Comments and bug-reports are welcome at our issue tracker at https://github.com/LanguageMachines/foliautils/issues or by mailing lamasoftware (at) science.ru.nl.


This software has been tested on:

  • Intel platforms running several versions of Linux, including Ubuntu, Debian, Arch Linux, Fedora (both 32 and 64 bits)
  • MAC platform running OS X 10.10

Contents of this distribution:

  • Sources
  • Licensing information ( COPYING )
  • Build system based on GNU Autotools
  • Dockerfile

Dependencies: To be able to succesfully build foliautils from source, you need the following dependencies:

  • ticcutils
  • libfolia
  • ucto
  • libicu-dev
  • libxml2-dev
  • libexttextcat-dev OR libtextcat-dev (OS dependant)
  • A sane C++ build environment with autoconf, automake, autoconf-archive, make, gcc or clang, libtool, pkg-config

To install ticcutils, libfolia and ucto, first consult whether your distribution's package manager has an up-to-date package. If not, you can use the supplied build-deps.sh script to automatically download and install the latest stable versions of these dependencis dependencies. You can pass a target directory prefix as first argument and you may need to prepend sudo to ensure you can install there.

To compile and install FoLiA-utils manually from source, provided you have all the dependencies installed, do:

$ bash bootstrap.sh
$ ./configure
$ make
$ make install

Container Usage

A pre-made container image can be obtained from Docker Hub as follows:

docker pull proycon/foliautils

You can build a docker container as follows, make sure you are in the root of this repository:

docker build -t proycon/foliautils .

This builds the latest stable release, if you want to use the latest development version from the git repository instead, do:

docker build -t proycon/foliautils --build-arg VERSION=development .

Run the container interactively as follows:

docker run -t -i proycon/foliautils

Or invoke the tool you want:

docker run proycon/foliautils FoLiA-page

Add the -v /path/to/your/data:/data parameter (before -t) if you want to mount your data volume into the container at /data .

About

Command-line utilities for working with the Format for Linguistic Annotation (FoLiA), powered by libfolia (C++), written by Ko van der Sloot (CLST, Radboud University)

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published