Persian Irony Detection using transformer-based language models

input: a text in Persian
output: classifying text as ironic and non-ironic

Dataset

Existing datasets:

Persian manually labeled dataset: MirasIrony
Persian automatically labeled dataset: Persian Irony Detection

Create new dataset steps (Crawling Persian tweets from a channel on Telegram and automatically labeling them)

crawling: Crawl public channels' messages on Telegram using the api of Telegram server in file crawling.py. Save crawled messages (json files) in ./crawled_messages
gathering: Concatenate crawled files, save wanted attributes of each tweet in a Pandas DataFrame, and save it in a csv file. The file gathering.py creates messages.csv.
cleaning: Basic clean on the previously created dataset and save it to messages_cleaned.csv
labeling: Set label to each tweet by its top-2 common reactions and split dataset to Train and Test sets. It saves files in ../dataset/.
Run: (The previous dataset will be replaced)

cd creating_dataset/
pip install requirements.txt
python crawling.py
python gathering.py
python cleaning.py
python labeling.py

Model

Finetuning an uncased language model on the Persian irony detection dataset

cd model/ 
pip install -r requirements.txt

Finetuning a transformer-based language model on irony detection dataset

python train.py  --datapath [path to dataset] --modelpath [path to transformer-based language model] --modelout [path to save finetuned model] --savemodel [path to save finetuned model] --maxlen [maximum sequence length] --batch [batch size] --epoch [epochs] --lr [learning rate]
# example
python train.py --datapath ../dataset/ --modelpath xlm-roberta-base --batch 16 --epoch 5

Predict label using trained model

python predict.py  --datapath [path to dataset] --modelpath [path to transformer-based language model] --predspath [path for preditions of test set] --maxlen [maximum sequence length] --batch [batch size] --epoch [epochs] --lr [learning rate]
# example
python predict.py --datapath ../dataset/ --modelpath xlm-roberta-base --predspath runs/preds

Results

Comparison of different finetuned language models on the Persian dataset

Language Model	Accuracy	Recall	Precision	F1
ParsBert vr3	81.3%	81.4%	81.3%	81.3%
XLM-RoBERTa-Base	82.6%	82.8%	82.6%	82.5%
XLM-RoBERTa-Large	84.7%	84.7%	84.6%	84.6%

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
creating_dataset		creating_dataset
dataset		dataset
model		model
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

creating_dataset

creating_dataset

dataset

dataset

model

model

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Persian Irony Detection using transformer-based language models

Dataset

Model

Results

About

Releases

Packages

Languages

fatemenajafi135/Irony-detection

Folders and files

Latest commit

History

Repository files navigation

Persian Irony Detection using transformer-based language models

Dataset

Model

Results

About

Topics

Resources

Stars

Watchers

Forks

Languages