Skip to content

ContentSide/LingX

Repository files navigation

LingX
PyPI version Contributions welcome License: MIT

LingX

A library for introducing state-of-the-art metrics on measuring linguistic complexity developed by ContentSide and CRITT at Kent State University.


LingX is:

  • A library for calculating some of the psycholinguistics complexity metrics.
  • A library for obtaining helpful metrics for translational process studies.
  • A library for different factors related to the text analysis such as accuracy, fluency and style.
  • A library with extended modules to easily integerate translational studies in CRITT TPR-DB.

How does LingX generally work?

LingX calculates different token-based and segment-based mono-bilingual complexity metrics. It internaly parses a given text into a dependency grammar graph. Using the graph and other linguistic information such as part-of-speech tagging, it can caculates different psycholinguistics, linguistic and translational process metrics. See the reference section for detailed information.

LingX uses Stanza state-of-the-arts NLP library for different language-based tasks. Stanza is a collection of accurate and efficient tools for the linguistic analysis of many human languages. Stanza brings state-of-the-art NLP models to different languages.

Quick Start

Requirements and Installation

The project is based on Stanza 1.2.1 and Python 3.6+. If you do not have Python 3.6, install it first. Then, in your favorite virtual environment, simply do:

pip install lingx

If you are running project in Jupyter Notebook or Google Colab enviroments run the following command instead:

!pip install lingx

Example Usage

Let's run a simple token-based psycholingual incomplete complexity theory (IDT) metric as a test. All you need to do is to make import related methods and codes as follows:

from lingx.utils import download_lang_models
from lingx.core.lang_model import get_nlp_object
from lingx.utils.lx import get_sentence_lx

nlp_en = get_nlp_object("en", use_critt_tokenization = False, package="partut")

input = "The reporter who the senator who John met attacked disliked the editor."

tokens_scores_list, aggregated_score = get_sentence_lx(
                                                       input,
                                                       nlp_en,
                                                       result_format="segment",
                                                       complexity_type="idt", 
                                                       aggregation_type="sum")

print(f"Tokens Scores List == {tokens_scores_list}")
print(f"Aggregated Score == {aggregated_score}")

This should print the incomplete complexity theory (IDT) metric list with related tokens and aggregated score using aggregated function sum:

Tokens Scores List == [['The', 1], ['reporter', 2], ['who', 3], ['the', 4], ['senator', 3], ['who', 4], ['John', 5], ['met', 2], ['attacked', 2], ['disliked', 2], ['the', 3], ['editor', 1], ['.', 0]]
Aggregated Score == 32

Tutorials

We provide a set of quick tutorials to get you started with the library:

The tutorials explain how the base metrics can be obtained. Let us know if anything is unclear.

CRITT Translation Process Database (TPR-DB)

The CRITT Translation Process Database (TPR-DB) is released under Creative Commons License (CC BY-NC-SA). Note that the available EN-ZH_IMBst18 database in this github belongs to CRITT TPR-DB.


Citing LingX

Please cite the paper https://doi.org/10.1007/978-3-030-98404-5_49 :

@InProceedings{10.1007/978-3-030-98404-5_49,
author="Zou, Longhui
and Carl, Michael
and Mirzapour, Mehdi
and Jacquenet, H{\'e}l{\`e}ne
and Vieira, Lucas Nunes",
editor="Kim, Jong-Hoon
and Singh, Madhusudan
and Khan, Javed
and Tiwary, Uma Shanker
and Sur, Marigankar
and Singh, Dhananjay",
title="AI-Based Syntactic Complexity Metrics and Sight Interpreting Performance",
booktitle="Intelligent Human Computer Interaction",
year="2022",
publisher="Springer International Publishing",
address="Cham",
pages="534--547",
isbn="978-3-030-98404-5"
}

For IDT-based and DLT-based complexities, please cite this paper:

@incollection{mirzapour2020,
  title={Measuring Linguistic Complexity: Introducing a New Categorial Metric},
  author={Mirzapour, Mehdi and Prost, Jean-Philippe and Retor{\'e}, Christian},
  booktitle={Logic and Algorithms in Computational Linguistics 2018 (LACompLing2018)},
  pages={95--123},
  year={2020},
  publisher={Springer}
}

Contact

Please email your questions or comments to Mehdi Mirzapour.

LingX is licensed under the following MIT License (MIT) Copyright © 2021 ContentSide and CRITT at Kent State University.