Spanish Controversy Detection Language Model

This repository contains the code of the paper "Anticipating the Debate: Predicting Controversy in News with Transformer-based NLP".

Controversy is a social phenomenon that emerges when a topic generates large disagreement among people. In the public sphere, controversy is very often related to news. Whereas previous approaches have addressed controversy detection, in this work, we propose to predict controversy based on the title and content of a news post. First, we collect and prepare a dataset from a Spanish news aggregator that labels the news' controversy in a community-based manner. Next, we experiment with the capabilities of language models to learn these labels by fine-tuning models that take both title and content, and the title alone. To cope with data unbalance, we undergo different experiments by sampling the dataset. The best model obtains an 84.72% micro-F1, trained with an unbalanced dataset and given the title and content as input. The preliminary results show that this task can be learned by relying on linguistic and social features.

Model 🤖

We use a dataset of news from the Menéame platform, tagged with controversy labels in a community-based manner. The best model was trained with a batch size of 4 and a learning rate of 1e-5 for 5 epochs. We then selected the best checkpoint using the downstream task metric in the corresponding development set and then evaluated it on the test set.

Hugging Face: https://huggingface.co/PlanTL-GOB-ES/Controversy-Prediction

Dataset 🗂️

Collection	Set	Instances
All	Controversial	5,584
All	Non controversial	231,385
Unbalanced Subset from All	Train	18,270
	Development	1,058
	Test	1,058
Balanced Subset from All	Train	9,900
	Development	634
	Test	1,058

Evaluation ✅

Dataset	Training Setting	F1	Accuracy	Time (s)
Balanced	Title	0.7026	0.6295	1653
Balanced	Title + Summary	0.8093	0.7353	1267
Unbalanced	Title	0.8197	0.7268	2631
Unbalanced	Title + Summary	0.8472	0.7662	2615

Usage example of the model ⚗️

from transformers import pipeline
from pprint import pprint

nlp = pipeline("text-classification", model="PlanTL-GOB-ES/Controversy-Prediction")
example = "Esposas, hijos, nueras y familiares de altos cargos del PP y de la cúpula universitaria llenan la URJC -- Pedro González-Trevijano, rector de la universidad desde 2002 a 2013, ahora magistrado del Tribunal Constitucional, y su sucesor en el cargo, Fernando Suárez han tejido una red que ha dado cobijo laboral a más de un centenar de familiares de vicerrectores, gerentes o catedráticos en los cuatro campus con los que cuenta la universidad localizados en Alcorcón, Móstoles, Fuenlabrada y Vicálvaro."

output = nlp(example)
pprint(output)

Code of our experiments

Dataset transformation: https://github.com/PlanTL-GOB-ES/controversy-detection-model/tree/main/src/dataset

Model training: https://github.com/PlanTL-GOB-ES/controversy-detection-model/tree/main/src/model_training

Statistics and results analysis: https://github.com/PlanTL-GOB-ES/controversy-detection-model/tree/main/src/statistics

Cite 📣

@article{PLN6484,
	author = {Blanca Calvo Figueras y Asier Gutiérrez-Fandiño y Marta Villegas},
	title = {Anticipating the Debate: Predicting Controversy in News with Transformer-based NLP},
	journal = {Procesamiento del Lenguaje Natural},
	volume = {70},
	number = {0},
	year = {2023},
	keywords = {},
	abstract = {Controversy is a social phenomenon that emerges when a topic generates large disagreement among people. In the public sphere, controversy is very often related to news. Whereas previous approaches have addressed controversy detection, in this work, we propose to predict controversy based on the title and content of a news post. First, we collect and prepare a dataset from a Spanish news aggregator that labels the news’ controversy in a community-based manner. Next, we experiment with the capabilities of language models to learn these labels by fine-tuning models that take both title and content, and the title alone. To cope with data unbalance, we undergo different experiments by sampling the dataset. The best model obtains an 84.72 micro-F1, trained with an unbalanced dataset and given the title and content as input. The preliminary results show that this task can be learned by relying on linguistic and social features.},
	issn = {1989-7553},
	url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6484},
	pages = {123--133}
}

Contact 📧

For questions regarding this work, contact bcalvo.bsc@gmail.com

Disclaimer

The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.

When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence.

In no event shall the owner of the models (SEDIA – State Secretariat for digitalization and artificial intelligence) nor the creator (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.

Los modelos publicados en este repositorio tienen una finalidad generalista y están a disposición de terceros. Estos modelos pueden tener sesgos y/u otro tipo de distorsiones indeseables.

Cuando terceros desplieguen o proporcionen sistemas y/o servicios a otras partes usando alguno de estos modelos (o utilizando sistemas basados en estos modelos) o se conviertan en usuarios de los modelos, deben tener en cuenta que es su responsabilidad mitigar los riesgos derivados de su uso y, en todo caso, cumplir con la normativa aplicable, incluyendo la normativa en materia de uso de inteligencia artificial.

En ningún caso el propietario de los modelos (SEDIA – Secretaría de Estado de Digitalización e Inteligencia Artificial) ni el creador (BSC – Barcelona Supercomputing Center) serán responsables de los resultados derivados del uso que hagan terceros de estos modelos.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
data		data
models		models
output		output
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

models

models

output

output

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Spanish Controversy Detection Language Model

Model 🤖

Dataset 🗂️

Evaluation ✅

Usage example of the model ⚗️

Code of our experiments

Cite 📣

Contact 📧

Disclaimer

About

Releases

Packages

Contributors 3

Languages

License

PlanTL-GOB-ES/controversy-detection-model

Folders and files

Latest commit

History

Repository files navigation

Spanish Controversy Detection Language Model

Model 🤖

Dataset 🗂️

Evaluation ✅

Usage example of the model ⚗️

Code of our experiments

Cite 📣

Contact 📧

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Languages