Skip to content

JulienBez/ArabicToTXM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArabicToTXM

This project was created during the CERES Hackathon event with the participation of Rimane Karam.

The goal was to create a script that can convert an arabic corpus (.doc, .docx or .txt format) to a TXM compatible file (.xml format). We use the work presented in Camelira: An Arabic Multi-Dialect Morphological Disambiguator (Ossama Obeid, Go Inoue, Nizar Habash, 2022) to apply multiple POS tags for each word of the corpus.

Installation

To install ArabicToTXM, you must have Python 3.x and pip installed. You must first install some dependencies for Camel-Tools, which is the package used to apply multiple POS tags. Refer to Camel-Tools official documentation for more informations. Here is the command to install those dependencies (for Ubuntu):

sudo apt-get install cmake libboost-all-dev

In case you want to convert .doc files, you must have LibreOffice installed:

sudo apt-get install libreoffice

Once all the dependencies installed, clone this repository on your computer. Open your terminal and go to the ArabicToTXM folder (where main.py is). Once in the indicated folder, install required packages with the following command:

pip install -r requirements.txt

Next execute this command to install Camel data:

camel_data -i light

How to use

The program contains two command lines. The first one retrieves the contents of word files (.doc and .docx) and places them in text files (.txt), one for each word document to be processed. The command is as follows:

python main.py --docxToText

The word files must be placed in the data/doc/. The text files resulting from this command are stored in data/text/. If your corpus is already in text format, just place the text files in data/text/ and ignore the first command.

The second command line will tokenize each text file and apply POS tags for each token. The result is a .xml file containing one word per line with its POS tags. The command is as follows:

python main.py --POStag

The applied POS tags list can be found in scripts/tags_list.json. For more informations about the tags you can add to the list, please refer to Camelira's online documentation and Camelira's tag list.

About

Script used to convert an arabic corpus to a TXM compatible file. Accepts .docx and .txt files.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages