Vec2Read: Multilingual Readability Assessment

Vec2Read is an automatic multilingual readability assessment tool developed in pytorch. This is the result of a research collaboration between Ion Madrazo Azpiazu and Maria Soledad Pera. The orginal paper for this project published in Transactions of the Association for Computational Linguistics can be seen here.

Please cite this work as follows:

@article{madrazo2019,
    author = {Madrazo Azpiazu, Ion and Pera, Maria Soledad},
    title = {Multiattentive Recurrent Neural Network Architecture for Multilingual Readability Assessment},
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {7},
    number = {},
    pages = {421-436},
    year = {2019},
    doi = {10.1162/tacl\_a\_00278},
    URL = {https://doi.org/10.1162/tacl_a_00278},
    eprint = {https://doi.org/10.1162/tacl_a_00278}
}

Abstract

We present a hierarchical recurrent neural network architecture for automatic multilingual readability assessment. This architecture considers raw words as main input, but internally captures text structure and informs its word attention process using other syntax- and morphology-related datapoints, known to be of great importance to readability. This is achieved by a novel multiattentive strategy that allows the neural network to focus on specific parts of a text for predicting its readability level. We conducted exhaustive evaluation using datasets targeting multiple languages and prediction task types, to compare the proposed model with several baseline strategies.

Requirements

In order to run this code you will first need to:

In addition, you will also need to install libraries described in requirements.txt file located on the root directory of this repository. You can install the by running the following command:

pip install -r requirements.txt

Training your own model

In order to train a model using Vec2Read you need to follow the steps below:

Decide on a data loader

You can find different data loaders under datasets directory. The recommeded one is SingleLanguageSyntaxnetDataLoader which loads a dataset parsed by syntaxnet. You can find an example of how to use syntaxnet to parse documents in notebooks/ParseDocumentsWithSyntaxnet.ipynb. The data loader expects each text to be in a separate file and in json format. In addition, all documents for an specific class need to be in a single directory.

# Example of a file structure for text: The drink is served hot.
# Note that each document is a list of sentences and each sentence is a list of word dictionaries.
[
[
        {
            "id": 1,
            "form": "The",
            "upostag": "DET",
            "xpostag": "DT",
            "feats": {
                "Definite": "Def",
                "fPOS": "DET++DT",
                "PronType": "Art"
            },
            "head": 2,
            "deprel": "det"
        },
        {
            "id": 2,
            "form": "drink",
            "upostag": "NOUN",
            "xpostag": "NN",
            "feats": {
                "fPOS": "NOUN++NN",
                "Number": "Sing"
            },
            "head": 4,
            "deprel": "nsubjpass"
        },
        {
            "id": 3,
            "form": "is",
            "upostag": "AUX",
            "xpostag": "VBZ",
            "feats": {
                "Mood": "Ind",
                "fPOS": "VERB++VBZ",
                "Number": "Sing",
                "Person": "3",
                "Tense": "Pres",
                "VerbForm": "Fin"
            },
            "head": 4,
            "deprel": "auxpass"
        },
        {
            "id": 4,
            "form": "served",
            "upostag": "VERB",
            "xpostag": "VBN",
            "feats": {
                "fPOS": "ADJ++JJ",
                "Degree": "Pos"
            },
            "head": 0,
            "deprel": "ROOT"
        },
        {
            "id": 5,
            "form": "hot",
            "upostag": "NOUN",
            "xpostag": "NN",
            "feats": {
                "fPOS": "NOUN++NN",
                "Number": "Sing"
            },
            "head": 4,
            "deprel": "xcomp"
        },
        {
            "id": 6,
            "form": ".",
            "upostag": "PUNCT",
            "xpostag": ".",
            "feats": {
                "fPOS": "PUNCT++."
            },
            "head": 4,
            "deprel": "punct"
        }
    ]
]

Configure training process

You will need to create an experiment configuration file. You can simply start by extending any of the files under config/ directory. Parts you will most likely need to modify are the following:

"classes": ["classname1","classname2"],
"data_folders": {"classname1":"folder/where/you/stored/all/files/for/classname1",
                  "classname2":"folder/where/you/stored/all/files/for/classname2"},
"embedding_folder":"embedding/folder",
"embedding_file":"name_of_embeddingfile.vec",
"max_number_of_files_per_category": <your number of files per category>,

Train

To run the experiment use the following command:

python main.py configs/config_name.json

License

This software has a free to use license for non commercial use. Check LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
agents		agents
configs		configs
datasets		datasets
experiments		experiments
graphs		graphs
logs		logs
notebooks		notebooks
pretrained_weights		pretrained_weights
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
architecture.png		architecture.png
docker-syntaxnet-compose.yml		docker-syntaxnet-compose.yml
downloadembeddings.sh		downloadembeddings.sh
main.py		main.py
requirements.txt		requirements.txt
run.sh		run.sh

License

ionmadrazo/Vec2Read

Folders and files

Latest commit

History

Repository files navigation

Vec2Read: Multilingual Readability Assessment

Abstract

Requirements

Training your own model

Decide on a data loader

Configure training process

Train

License

About

Resources

License

Stars

Watchers

Forks

Languages