Skip to content

CAMeL-Lab/camel_parser

Repository files navigation

CamelParser

MIT License

Introduction

CamelParser is an open-source Python-based Arabic dependency parser targeting two popular Arabic dependency formalisms, the Columbia Arabic Treebank (CATiB), and Universal Dependencies (UD).

The CamelParser pipeline handles the processing of raw text and produces tokenization, part-of-speech and rich morphological features. For disambiguation, users can choose between the BERT unfactored disambiguator, or a lighter Maximum Likelihood Estimation (MLE) disambiguator, both of which are included in CAMeL Tools. For dependency parsing, we use the SuPar Biaffine Dependency Parser.

Installation

  1. Clone this repo
  2. Install the required packages:
pip install -r requirements.txt
  1. Download dependency parsing models:
python download_models.py

Currently, two Arabic script models, CATiB and UD, will be downloaded from the CAMeL Lab's parser models collection on Hugging Face. More models will be added soon!

Examples

Below are examples using the different inputs that CamelParser accepts. We pass each example as a string using -s. However, when passing multiple sentences it is better to use -i along with the path to the file containing the sentences.

Passing text

python text_to_conll_cli.py -f text -s "جامعة نيويورك أبو ظبي تنشر أول أطلس لكوكب المريخ باللغة العربية."

The verbose version of the above example (default values are shown)

python text_to_conll_cli.py -f text -b r13 -d bert -m catib -t catib6 -s "جامعة نيويورك أبو ظبي تنشر أول أطلس لكوكب المريخ باللغة العربية."

Passing preprocessed text (cleaned and whitespace tokenized)

python text_to_conll_cli.py -f preprocessed_text -s "جامعة نيويورك أبو ظبي تنشر أول أطلس لكوكب المريخ باللغة العربية ."

Note that the difference between the -f text and preprocessed_text parser input settings is that for text we use different utilities from CAMeL Tools to normalize unicode, dediactritize, clean the text using arclean, and perform whitespace tokenization.

tokenized is used when 1) the text has already been tokenized, and 2) only dependency relations are needed; the POS tags and features will not be generated.

python text_to_conll_cli.py -f tokenized -s "جامعة نيويورك أبو ظبي تنشر أول أطلس ل+ كوكب المريخ ب+ اللغة العربية ."

tokenized_tagged is used when the user has the tokens and POS tags. They should be passed as tuples.

python text_to_conll_cli.py -f tokenized_tagged -s "(جامعة, NOM) (نيويورك, PROP) (أبو, PROP) (ظبي, PROP) (تنشر, VRB) (أول, NOM) (أطلس, NOM) (ل+, PRT) (كوكب, NOM) (المريخ, PROP) (ب+, PRT) (اللغة, NOM) (العربية, NOM) (., PNX)"

Extending the code

You can also use different parts of the code to create your own pipeline. The handle_multiple_texts.py is an example of that. It can be used to parse a directory of text files, saving the resulting CoNLL-X files to a given output directory.

Using another morphology database

Curently, the CamelParser uses CAMeLTools' default morphology database, the morphology-db-msa-r13.

For our paper, we used the calima-msa-s31 database. To use this database, follow these steps (note that you need an account with the LDC):

  1. Install camel_tools v1.5.2 or later (you can check this using camel_data -v)
  2. Download the camel data for the BERT unfactored (MSA) model, as well as the morphology database:
camel_data -i morphology-db-msa-s31 
camel_data -i disambig-bert-unfactored-msa
  1. Download the LDC2010L01 from the ldc downloads:
  2. DO NOT EXTRACT LDC2010L01.tgz! We'll use the following command from camel tools to install the db:
camel_data -p morphology-db-msa-s31 /path/to/LDC2010L01.tgz
  1. When running the main script, use -b and pass calima-msa-s31.

Citation

If you find the CamelParser useful in your research, please cite

@inproceedings{Elshabrawy:2023:camelparser,
    title = "{CamelParser2.0: A State-of-the-Art Dependency Parser for Arabic}",
    author = {Ahmed Elshabrawy and 
Muhammed AbuOdeh and
Go Inoue and
Nizar Habash} ,
    booktitle = {Proceedings of The First Arabic Natural Language Processing Conference (ArabicNLP 2023)},
    year = "2023"
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages