Skip to content

cainesap/CEFRgrader

Repository files navigation

CEFRgrader

Code accompanying REPROLANG 2020 submission This repository relates to the REPROLANG 2020 shared task @ the LREC conference, specifically Task D.2 replicating the following paper:

Sowmya Vajjala & Taraka Rama, 2018. Experiments with Universal CEFR classifications. In Proceedings of BEA. (V&R)

To refer to this new piece of work please cite:

Andrew Caines & Paula Buttery, 2020. REPROLANG 2020: Automatic Proficiency Scoring of Czech, English, German, Italian, and Spanish learner essays. In: Proceedings of the 12th Language Resources and Evaluation Conference.

@InProceedings{caines-buttery:2020:LREC,
  author    = {Caines, Andrew  and  Buttery, Paula},
  year      = {2020},
  title     = {REPROLANG 2020: Automatic Proficiency Scoring of Czech, English, German, Italian, and Spanish Learner Essays},
  booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
  url       = {https://www.aclweb.org/anthology/2020.lrec-1.689}
}

Link to full-text paper.


REPROLANG 2020 Task D.2

Andrew Caines, @cainesap, University of Cambridge, UK

V&R's original code needed little amendment. Please refer to that repository and readme for background information. This repository describes my replication efforts only.

The main issue related to lack of clarity about the workflow. Here I list the steps to get from corpus download to end results. Any errors are my own.

Pre-processing

  1. download the MERLIN corpus (all files) from CLARIN via https://merlin-platform.eu/C_data.php
  2. unzip downloaded file, unzip merlin-text-v1.1.zip and move merlin-text-v1.1/meta_ltext to your preferred location (= $MERLIN)
  3. make an output directory for processed files (= $OUTDIR)
  4. download or clone this repository (i.e. git clone https://github.com/cainesap/reprolang_github.git)
  5. change directory to root of the repository (i.e. cd reprolang_github)
  6. run python3 01_corpusCollation.py $MERLIN $OUTDIR
  7. note that the exclusion of files described in the V&R paper is now handled in step 6 (rather than posthoc removal with a separate script)

Feature extraction

  1. install the udpipe R library (i.e. > install.packages('udpipe')) and download version 2.0 models (not the latest models; but to match the ones V&R used) for Czech, German and Italian from LINDAT/CLARIN
  2. install the Language Tool from source (i.e. curl -L https://raw.githubusercontent.com/languagetool-org/languagetool/master/install.sh | bash; now your path to 'LanguageTool-V.v-stable' = $LANGTOOL
  3. run Rscript 02_featureExtraction.R $UDPIPE $LANGTOOL $INDIR $FEATSFILE for feature extraction from texts, as well as test-fold definition through stratified sampling, and print out of tokenised text

Experiments

  1. run Rscript 03_classificationExperiments_[monoling|multiling|crossling].R to run monolingual, multilingual, crosslingual experiments like V&R
  2. run Rscript 04_resultsSummary.R to print a summary of results based on experiment logs

Andrew Caines, March 2020

About

Code accompanying REPROLANG 2020 submission

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published