Linguistically Motivated Features for Short-Text Genre Classification

Abstract

This work deploys linguistically motivated features to classify paragraph-level text into fiction and non-fiction genre using a logistic regression model and infers lexical and syntactic properties that help distinguish the two genres. Previous works have focused on classifying document-level text into fiction and non-fiction genres, while in this work, we deal with shorter texts which are closer to real-world applications like sentiment analysis of tweets. For the task of short-text classification on the Brown corpus, a model containing linguistically motivated features confers a substantial accuracy jump over a baseline model consisting of simple POS-ratio features found effective in previous work. The efficacy of the above model containing a linguistically motivated feature set also transfers over to another dataset viz, Baby BNC corpus. Subsequently, we compared the classification accuracy of the logistic regression model with two deep-learning models. A 1D-CNN model gives an increase of 2% accuracy over the logistic regression classifier on both datasets. A BERT-based model gives state-of-the art classification accuracies of 97% on Brown corpus and 98% on Baby BNC corpus. Although, both these deep learning models give better results in terms of classification accuracy, the problem of interpreting these models remains an open question. In contrast, regression model coefficients revealed that fiction texts tend to have more character-level diversity and have lower lexical density (quantified using contentfunction word ratios) compared to non-fiction texts. Moreover, subtle differences in word order exist between the two genres, i.e., in fiction texts Verbs precede Adverbs in contrast to the opposite pattern in non-fiction texts (inter-alia).

Features

The repository comprises codes to calculate four distinct categories of features, each with various sub-types as outlined in the paper:

Raw Features
Lexical Features
POS Features
Syntactic Features

Usage

Install the required libraries listed in the requirements.txt file.
Note: The 'stanfordcorenlp' client requires JAVA 8+ for constituency parsing.
Run the inference.py file to calculate features and running the trained models for predictions.

Folder Structure

src
- extract_all_features.py: Calculates all four categories of features.
brown_corpus
- brown_corpus_logistic_regression_analysis.ipynb: Notebook with classification accuracy using different feature sets on Brown corpus paragraphs (refer to Table 2).
- brown_corpus_CNN_Glove_analysis.ipynb: Notebook with classification accuracy of the 1D CNN model on Brown Corpus (refer to Table 4).
- brown_corpus_bert_fine_tune.py: Fine-tuning codes of BERT on Brown corpus data (refer to Table 4).
baby_bnc
- baby_bnc_corpus_logistic_regression.ipynb: Notebook with classification accuracy on Baby BNC corpus trained on Brown corpus using Logistic Regression model (refer to Table 3).
- baby_bnc_corpus_CNN_Glove_analysis.ipynb: Notebook with classification accuracy of the 1D CNN model on Baby BNC Corpus (refer to Table 3).
inference.py: Example file demonstrating how to calculate features and use the trained model to predict genres for a given text.

Citation

@inproceedings{kazmi-etal-2022-linguistically,
    title = "Linguistically Motivated Features for Classifying Shorter Text into Fiction and Non-Fiction Genre",
    author = "Kazmi, Arman  and
      Ranjan, Sidharth  and
      Sharma, Arpit  and
      Rajkumar, Rajakrishnan",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.77",
    pages = "922--937",
    
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
baby_bnc_corpus		baby_bnc_corpus
brown_corpus		brown_corpus
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

baby_bnc_corpus

baby_bnc_corpus

brown_corpus

brown_corpus

data

data

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

init.py

init.py

inference.py

inference.py

requirements.txt

requirements.txt

Repository files navigation

Linguistically Motivated Features for Short-Text Genre Classification

Abstract

Features

Usage

Folder Structure

Citation

About

Releases

Packages

Languages

License

armankazmi/Linguistic-features-of-text

Folders and files

Latest commit

History

Repository files navigation

Linguistically Motivated Features for Short-Text Genre Classification

Abstract

Features

Usage

Folder Structure

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages