CEFRgrader

Code accompanying REPROLANG 2020 submission This repository relates to the REPROLANG 2020 shared task @ the LREC conference, specifically Task D.2 replicating the following paper:

Sowmya Vajjala & Taraka Rama, 2018. Experiments with Universal CEFR classifications. In Proceedings of BEA. (V&R)

To refer to this new piece of work please cite:

Andrew Caines & Paula Buttery, 2020. REPROLANG 2020: Automatic Proficiency Scoring of Czech, English, German, Italian, and Spanish learner essays. In: Proceedings of the 12th Language Resources and Evaluation Conference.

@InProceedings{caines-buttery:2020:LREC,
  author    = {Caines, Andrew  and  Buttery, Paula},
  year      = {2020},
  title     = {REPROLANG 2020: Automatic Proficiency Scoring of Czech, English, German, Italian, and Spanish Learner Essays},
  booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
  url       = {https://www.aclweb.org/anthology/2020.lrec-1.689}
}

Link to full-text paper.

REPROLANG 2020 Task D.2

Andrew Caines, @cainesap, University of Cambridge, UK

V&R's original code needed little amendment. Please refer to that repository and readme for background information. This repository describes my replication efforts only.

The main issue related to lack of clarity about the workflow. Here I list the steps to get from corpus download to end results. Any errors are my own.

Pre-processing

download the MERLIN corpus (all files) from CLARIN via https://merlin-platform.eu/C_data.php
unzip downloaded file, unzip merlin-text-v1.1.zip and move merlin-text-v1.1/meta_ltext to your preferred location (= $MERLIN)
make an output directory for processed files (= $OUTDIR)
download or clone this repository (i.e. git clone https://github.com/cainesap/reprolang_github.git)
change directory to root of the repository (i.e. cd reprolang_github)
run python3 01_corpusCollation.py $MERLIN $OUTDIR
note that the exclusion of files described in the V&R paper is now handled in step 6 (rather than posthoc removal with a separate script)

Feature extraction

install the udpipe R library (i.e. > install.packages('udpipe')) and download version 2.0 models (not the latest models; but to match the ones V&R used) for Czech, German and Italian from LINDAT/CLARIN
install the Language Tool from source (i.e. curl -L https://raw.githubusercontent.com/languagetool-org/languagetool/master/install.sh | bash; now your path to 'LanguageTool-V.v-stable' = $LANGTOOL
run Rscript 02_featureExtraction.R $UDPIPE $LANGTOOL $INDIR $FEATSFILE for feature extraction from texts, as well as test-fold definition through stratified sampling, and print out of tokenised text

Experiments

run Rscript 03_classificationExperiments_[monoling|multiling|crossling].R to run monolingual, multilingual, crosslingual experiments like V&R
run Rscript 04_resultsSummary.R to print a summary of results based on experiment logs

Andrew Caines, March 2020

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
R		R
00_controlScript.sh		00_controlScript.sh
01_corpusCollation.py		01_corpusCollation.py
02_featureExtraction.R		02_featureExtraction.R
03_classificationExperiments_parallel_crossling.R		03_classificationExperiments_parallel_crossling.R
03_classificationExperiments_parallel_monoling.R		03_classificationExperiments_parallel_monoling.R
03_classificationExperiments_parallel_multiling.R		03_classificationExperiments_parallel_multiling.R
04_resultsSummary.R		04_resultsSummary.R
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R

R

00_controlScript.sh

00_controlScript.sh

01_corpusCollation.py

01_corpusCollation.py

02_featureExtraction.R

02_featureExtraction.R

03_classificationExperiments_parallel_crossling.R

03_classificationExperiments_parallel_crossling.R

03_classificationExperiments_parallel_monoling.R

03_classificationExperiments_parallel_monoling.R

03_classificationExperiments_parallel_multiling.R

03_classificationExperiments_parallel_multiling.R

04_resultsSummary.R

04_resultsSummary.R

LICENSE

LICENSE

README.md

README.md

Repository files navigation

CEFRgrader

REPROLANG 2020 Task D.2

Pre-processing

Feature extraction

Experiments

About

Releases

Packages

Languages

License

cainesap/CEFRgrader

Folders and files

Latest commit

History

Repository files navigation

CEFRgrader

REPROLANG 2020 Task D.2

Pre-processing

Feature extraction

Experiments

About

Resources

License

Stars

Watchers

Forks

Languages