Code for paper 'Combining n-grams and deep convolutional features for language variety classification'

Installation, documentation

Python 3 is needed.
To get the code and all the data, clone the project from the repository with 'git clone http://source.ijs.si/mmartinc/NLE_2017'
To just get the code (and some data, but not all, which means that the results from the paper will not be reproducible), clone the project from the EMBEDDIA repository with 'git clone https://github.com/EMBEDDIA/NLE_2017'.
Install dependencies if needed: pip3 install -r requirements.txt

To reproduce the results published in the paper run the code in the command line using following commands:

python3 train.py --experiment DSLCC;
python3 train.py --experiment GDIC;
python3 train.py --experiment ADIC;

The DSLCC v4.0 and ADIC data can be found in the data folder. The GDIC corpus is not available but you can ask the Vardial 2018 GDI task administrator for a copy and download it in the /data/gdic folder.

If you use the DSLCC v4.0 dataset, please refer to the following corpus description paper: Liling Tan, Marcos Zampieri, Nikola Ljubešić, Jörg Tiedemann (2014): Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection. Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC). pp. 6-10. Reykjavik, Iceland.

If you use the ADIC dataset, please refer to the following corpus description paper: Ahmed Ali, Najim Dehak, Patrick Cardinal, Sameer Khurana, Sree Harsha Yella, James Glass, Peter Bell, Steve Renals (2015): Automatic dialect detection in arabic broadcast speech. In Proceedings of Interspeech.

You can also use the system on your own custom datasets. Following arguments are available:

--experiment: Default is DSLCC. Other allowed values are: GDIC, ADIC and OTH (for custom datasets).
--data_directory : Path to data directory
--train_corpus : Path to train corpus - first column should be text, second a label. Columns should be separated by tab.
--dev_corpus : Path to development corpus - first column should be text, second a label. Columns should be separated by tab..
--test_corpus: Path to test corpus - first column should be text, second a label. Columns should be separated by tab. --weighting: Define the weighting scheme, it can either be "tfidf" or "bm25". Default is "tfidf"

For further costumization, just tweak the code or send me an email (matej.martinc@ijs.si) :).

Contributors to the code

Matej Martinc
Senja Pollak

Knowledge Technologies Department, Jožef Stefan Institute, Ljubljana

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
bm25.py		bm25.py
dl_architecture.py		dl_architecture.py
language_variety.py		language_variety.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

.gitignore

.gitignore

LICENSE.md

LICENSE.md

README.md

README.md

bm25.py

bm25.py

dl_architecture.py

dl_architecture.py

language_variety.py

language_variety.py

requirements.txt

requirements.txt

train.py

train.py

Repository files navigation

Code for paper 'Combining n-grams and deep convolutional features for language variety classification'

Installation, documentation

Contributors to the code

About

Releases

Packages

Languages

License

EMBEDDIA/NLE_2017

Folders and files

Latest commit

History

Repository files navigation

Code for paper 'Combining n-grams and deep convolutional features for language variety classification'

Installation, documentation

Contributors to the code

About

Resources

License

Stars

Watchers

Forks

Languages