Exploring the Impact of Lexical and Grammatical Features on Automatic Genre Identification

Analysis of feature importance for genre identification through data transformation

Task Description

In this task, I analyse what importance different linguistic features have for the task of the automatic (web) genre identification (AGI) by comparing the performance of machine learning models, trained on various text representations. With this approach, I will be able to discover to which extent are lexical, grammatical and other features important for the identification of genre.

I perform text classification with the linear model fastText. For the experiments, I use the Slovene Web genre identification corpus GINCO 1.0 which consists of 1002 texts, manually annotated with 24 genre labels.

I train and test the fastText model on:

baseline: plain text as extracted from the web during the creation of web corpora (used in previous experiments)
pre-processed: lower-cased, punctuation removed, numbers removed
reduced to lemmas
transformed into part-of-speech tags: part-of-speech tags (upos) and morphosyntactic descriptors (MSD)
transformed into syntactic dependencies
(consisting only of the words belonging to a certain word type, i.e. only nouns, only verbs, only adjectives, etc.)

The setups are compared based on micro and macro F1 scores, to measure the models’ performance on the instance level and the label level, and confusion matrices.

The FastText model was used as it achieves the best results on the task, when compared with other common classifiers. The comparison is based on the baseline text and other classifiers used the TF-IDF representation. The classifiers are ordered based on the macro F1 scores.

Model	Micro F1	Macro F1
Dummy Classifier - Most Frequent	0.241	0.078
Dummy Classifier - Stratified	0.27	0.221
Support Vector Machine (SVC)	0.489	0.333
Decision Tree	0.34	0.35
Multinomial Naive Bayes classifier	0.518	0.342
Logistic Regression	0.518	0.383
Random Forest classifier	0.511	0.408
Complement Naive Bayes classifier	0.539	0.416
FastText	0.56	0.589

Steps

Data Preparation, Experiment Setup

See the notebook 1-Preparing_Data_Hyperparameter_Search.ipynb where I found the best hyperparameters for the task, and 2-Language-Processing-of-GINCO.ipynb where I linguistically preprocessed data with the CLASSLA pipeline.

Data:

GINCO corpus with "keep" texts (reasons: more text than if we would use the deduplicated paragraphs only, but certain manually-annotated duplicates omitted as they can be unrepresentative for the genre type)
smaller number of labels: downsampled 12 set, labels with too few instances discarded, fuzzy labels (Other, List of Summaries/Excerpts) discarded, texts marked with Hard discarded --> 5 labels, 688 texts
original stratified train-dev-test split (60:20:20): 410:141:137

Preliminary experiments:

Optimising FastText - hyperparameter search on dev split --> average micro and macro F1 scores of 0.625 +/- 0.0036 and 0.618 +/- 0.003

Experiment Setup Conclusions

Experiments on no. of epochs --> 350 epochs used

* Experiments on learning rate --> lr = 0.7

* Experiments on number of word n-grams used --> suprisingly, using unigrams (default) gives the best results

* Default context window (5)

Experiments on Text Representations

The figure with main results:

Main conclusions:

the best textual representation is syntactic dependencies.
some genre labels favor lexical textual representations, others, such as Forum are better classified when using grammatical representations

For more figures regarding the results, see the folder results. The script for analyzing results is 6-Result_Analysis.ipynb.

Results (details):

baseline text: micro F1: 0.56 +/- 0.0, macro F1: 0.589 +/- 0.0
lower-cased: micro F1: 0.553 +/- 0.0045, macro F1: 0.587 +/- 0.009 - slightly lower results
punctuation removed: micro F1: micro F1: 0.58 +/- 0.0028, macro F1: 0.616 +/- 0.0024 - improved results, especially forum (see the graph)
numbers removed: micro F1: 0.583 +/- 0.0028, macro F1: 0.595 +/- 0.0025 - slight improvement, except in Forum, where it is worse
lower-cased, punctuation removed, numbers removed: micro F1: 0.56 +/- 0.0, macro F1: 0.598 +/- 0.0 - no improvements in micro level, very slight improvements in macro level - improvement in forum, otherwise mostly no
lower-cased, punctuation removed, numbers removed, stopwords removed: micro F1: 0.596 +/- 0.0, macro F1: 0.597 +/- 0.00029 - improvement, more in micro than macro
lemmas: micro F1: 0.597 +/- 0.0053, macro F1: 0.601 +/- 0.0035 - significant improvement over the baseline, especially for Information/Explanation, Promotion, for News no change, for Forum and Opinion worse
part-of-speech tags (upos): micro F1: 0.54 +/- 0.0053, macro F1: 0.547 +/- 0.0056 - decrease overall, but in News and Opinion increase
morphosyntactic descriptors (MSD): micro F1: 0.563 +/- 0.0072, macro F1: 0.536 +/- 0.019, increase in micro, decrease in macro, improvement in News and Information, high variation in Forum
syntactic dependencies: micro F1: 0.61 +/- 0.0, macro F1: 0.639 +/- 0.00044 - the best results, high improvement, especially in News, Forum, Opinion. Decrease in Promotion.

Additional experiments

Reduced features (lemmas in selected PoS class used, other replaced with O):

only open class words - stopwords removed (ADP, AUX, CCONJ and SCONJ, DET, NUM, PART and PRON): micro F1: 0.563 +/- 0.0072, macro F1: 0.535 +/- 0.015 - decrease in macro, slight increase in micro - stopwords do not have a big impact - huge decrease in Forum
only stop words: micro F1: 0.526 +/- 0.0053, macro F1: 0.559 +/- 0.0067 - decrease, especially in micro - same result for Forum, decrease in News and Information
only classes which denote subjectivity - ADJ, ADV, PART: micro F1: 0.468 +/- 0.009, macro F1: 0.408 +/- 0.019 - much lower results, huge decrease for Forum, for Opinion actually slightly better results
only PROPN, NOUN and VERB: micro F1: 0.496 +/- 0.0078, macro F1: 0.439 +/- 0.015 - decrease, for Information/Explanation increase, for others decrease, especially big for Forum and Opinion

Alternative representations without context (window = 1) - very very slight difference:

baseline text: micro F1: 0.559 +/- 0.0028, macro F1: 0.588 +/- 0.002 - slightly different, almost the same
lemmas: micro F1: 0.597 +/- 0.0053, macro F1: 0.602 +/- 0.0039
part-of-speech tags (upos): micro F1: 0.546 +/- 0.0078, macro F1: 0.555 +/- 0.012
morphosyntactic descriptors (MSD): micro F1: 0.566 +/- 0.0069, macro F1: 0.539 +/- 0.014
syntactic dependencies: micro F1: 0.609 +/- 0.0028, macro F1: 0.637 +/- 0.0026

Additional experiment - on all 12 labels (primary_level_3), all 1002 texts:

baseline: micro F1: 0.425 +/- 0.0043, macro F1: 0.273 +/- 0.005
dependencies: micro F1: 0.48 +/- 0.0018, macro F1: 0.337 +/- 0.018 - improved results

Using a Transformer model

To compare the fastText's performance with the performance of Transformer models, I trained and tested the base-sized XLM-RoBERTa model on the baseline text.

During the hyperparameter search, I searched for the optimum number of epochs, which revealed to be 13. The hyperparameters that we used are the following:

        args= {
            "overwrite_output_dir": True,
            "num_train_epochs": 13,
            "train_batch_size":8,
            "learning_rate": 1e-5,
            "labels_list": LABELS,
            "max_seq_length": 512,
            "save_steps": -1,
            # Only the trained model will be saved - to prevent filling all of the space
            "save_model_every_epoch":False,
            "wandb_project": 'GINCO-hyperparameter-search',
            "silent": True,
            }

The trained model was saved to the Wandb directory:

import wandb
run = wandb.init()
# Load the saved model
artifact = run.use_artifact('tajak/GINCO-hyperparameter-search/GINCO-5-labels-classifier:v0', type='model')
artifact_dir = artifact.download()

# Loading a local save
model = ClassificationModel(
    "xlmroberta", "artifacts/GINCO-5-labels-classifier:v0")

Results on dev split: Macro f1: 0.82, Micro f1: 0.818 Results on test split: Macro f1: 0.813, Micro f1: 0.816

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
__pycache__		__pycache__
data		data
experimental-setup-results		experimental-setup-results
results		results
.gitattributes		.gitattributes
1-Preparing_Data_Hyperparameter_Search.ipynb		1-Preparing_Data_Hyperparameter_Search.ipynb
2-Language-Processing-of-GINCO.ipynb		2-Language-Processing-of-GINCO.ipynb
3-Learning-on-Various-Representations.ipynb		3-Learning-on-Various-Representations.ipynb
4-Experiments-1-2.ipynb		4-Experiments-1-2.ipynb
5-Experiment-on-Specific-Word-Tags.ipynb		5-Experiment-on-Specific-Word-Tags.ipynb
6-Result_Analysis.ipynb		6-Result_Analysis.ipynb
7-Calculate-label-frequency.ipynb		7-Calculate-label-frequency.ipynb
8-Baseline-Experiments-with-other-Classifiers.ipynb		8-Baseline-Experiments-with-other-Classifiers.ipynb
9-Training-a-Transformer-model.ipynb		9-Training-a-Transformer-model.ipynb
Additional-Ideas-for-Representations.ipynb		Additional-Ideas-for-Representations.ipynb
README.md		README.md

TajaKuzman/Text-Representations-in-FastText

Folders and files

Latest commit

History

Repository files navigation

Exploring the Impact of Lexical and Grammatical Features on Automatic Genre Identification

Task Description

Steps

Data Preparation, Experiment Setup

Experiment Setup Conclusions

Experiments on Text Representations

Additional experiments

Using a Transformer model

About

Topics

Resources

Stars

Watchers

Forks

Languages