Skip to content

ionmadrazo/VikiWiki

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

Cross-lingual Readability Assessment Dataset in 6 Languages (VikiWiki)

This repository contains the dataset for the scientific article: "Is Cross-lingual Readability Assessment Possible?". This is the result of a research collaboration between Ion Madrazo Azpiazu and Maria Soledad Pera.

Please cite this work as follows:

    @article{madrazo:2019,
	author = {Ion Madrazo Azpiazu and Maria Soledad Pera},
	year = "2019",
	title = {Is Cross-lingual Readability Assessment Possible?},
	journal = {In press},
	volume = "1",
	number = "1",
	pages = "1--18"
}

Abstract

Most research efforts related to automatic readability assessment focus on the design of strategies that apply to a specific language. These state-of-the-art strategies are highly dependent on linguistic features that best suit the language for which they were intended, constraining their adaptability and making it difficult to determine whether they would remain effective if they were applied to estimate the level of difficulty of texts in other languages. In this paper, we present the results of a study designed to determine the feasibility of a cross-lingual readability assessment strategy. For doing so, we first analyzed the most common features used for readability assessment and determined their influence on the readability prediction process of six different languages: English, Spanish, Basque, Italian, French, and Catalan. In addition, we developed a cross-lingual readability assessment strategy that serves as a means to empirically explore potential advantages of employing a single strategy (and set of features) for readability assessment in different languages, including inter-language prediction agreement and prediction accuracy improvement for low-resource languages.

Dataset contents

The dataset contains corpora in two difficulty levels (Vikidia articles are considered to be simpler than Wikipedia articles) and 6 languages (English, Spanish, French, Italian, Catalan, and Basque). Each level and language contains 448 articles.

License

This software is provided under the Attribution-ShareAlike 3.0 United States License. Check LICENSE file for more details.

About

Readability Assessment Dataset in 6 Languages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published