Skip to content

This repository contains the code of the paper "Anticipating the Debate: Predicting Controversy in News with Transformer-based NLP"

License

Notifications You must be signed in to change notification settings

PlanTL-GOB-ES/controversy-detection-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spanish Controversy Detection Language Model

This repository contains the code of the paper "Anticipating the Debate: Predicting Controversy in News with Transformer-based NLP".

Controversy is a social phenomenon that emerges when a topic generates large disagreement among people. In the public sphere, controversy is very often related to news. Whereas previous approaches have addressed controversy detection, in this work, we propose to predict controversy based on the title and content of a news post. First, we collect and prepare a dataset from a Spanish news aggregator that labels the news' controversy in a community-based manner. Next, we experiment with the capabilities of language models to learn these labels by fine-tuning models that take both title and content, and the title alone. To cope with data unbalance, we undergo different experiments by sampling the dataset. The best model obtains an 84.72% micro-F1, trained with an unbalanced dataset and given the title and content as input. The preliminary results show that this task can be learned by relying on linguistic and social features.

Model 🤖

We use a dataset of news from the Menéame platform, tagged with controversy labels in a community-based manner. The best model was trained with a batch size of 4 and a learning rate of 1e-5 for 5 epochs. We then selected the best checkpoint using the downstream task metric in the corresponding development set and then evaluated it on the test set.

Hugging Face: https://huggingface.co/PlanTL-GOB-ES/Controversy-Prediction

Dataset 🗂️

Collection Set Instances
All Controversial 5,584
Non controversial 231,385
Unbalanced Subset from All Train 18,270
Development 1,058
Test 1,058
Balanced Subset from All Train 9,900
Development 634
Test 1,058

Evaluation ✅

Dataset Training Setting F1 Accuracy Time (s)
Balanced Title 0.7026 0.6295 1653
Title + Summary 0.8093 0.7353 1267
Unbalanced Title 0.8197 0.7268 2631
Title + Summary 0.8472 0.7662 2615

Usage example of the model ⚗️

from transformers import pipeline
from pprint import pprint

nlp = pipeline("text-classification", model="PlanTL-GOB-ES/Controversy-Prediction")
example = "Esposas, hijos, nueras y familiares de altos cargos del PP y de la cúpula universitaria llenan la URJC -- Pedro González-Trevijano, rector de la universidad desde 2002 a 2013, ahora magistrado del Tribunal Constitucional, y su sucesor en el cargo, Fernando Suárez han tejido una red que ha dado cobijo laboral a más de un centenar de familiares de vicerrectores, gerentes o catedráticos en los cuatro campus con los que cuenta la universidad localizados en Alcorcón, Móstoles, Fuenlabrada y Vicálvaro."

output = nlp(example)
pprint(output)

Code of our experiments

Dataset transformation: https://github.com/PlanTL-GOB-ES/controversy-detection-model/tree/main/src/dataset

Model training: https://github.com/PlanTL-GOB-ES/controversy-detection-model/tree/main/src/model_training

Statistics and results analysis: https://github.com/PlanTL-GOB-ES/controversy-detection-model/tree/main/src/statistics

Cite 📣

@article{PLN6484,
	author = {Blanca Calvo Figueras y Asier Gutiérrez-Fandiño y Marta Villegas},
	title = {Anticipating the Debate: Predicting Controversy in News with Transformer-based NLP},
	journal = {Procesamiento del Lenguaje Natural},
	volume = {70},
	number = {0},
	year = {2023},
	keywords = {},
	abstract = {Controversy is a social phenomenon that emerges when a topic generates large disagreement among people. In the public sphere, controversy is very often related to news. Whereas previous approaches have addressed controversy detection, in this work, we propose to predict controversy based on the title and content of a news post. First, we collect and prepare a dataset from a Spanish news aggregator that labels the news’ controversy in a community-based manner. Next, we experiment with the capabilities of language models to learn these labels by fine-tuning models that take both title and content, and the title alone. To cope with data unbalance, we undergo different experiments by sampling the dataset. The best model obtains an 84.72 micro-F1, trained with an unbalanced dataset and given the title and content as input. The preliminary results show that this task can be learned by relying on linguistic and social features.},
	issn = {1989-7553},
	url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6484},
	pages = {123--133}
}

Contact 📧

For questions regarding this work, contact bcalvo.bsc@gmail.com

Disclaimer

The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.

When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence.

In no event shall the owner of the models (SEDIA – State Secretariat for digitalization and artificial intelligence) nor the creator (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.

Los modelos publicados en este repositorio tienen una finalidad generalista y están a disposición de terceros. Estos modelos pueden tener sesgos y/u otro tipo de distorsiones indeseables.

Cuando terceros desplieguen o proporcionen sistemas y/o servicios a otras partes usando alguno de estos modelos (o utilizando sistemas basados en estos modelos) o se conviertan en usuarios de los modelos, deben tener en cuenta que es su responsabilidad mitigar los riesgos derivados de su uso y, en todo caso, cumplir con la normativa aplicable, incluyendo la normativa en materia de uso de inteligencia artificial.

En ningún caso el propietario de los modelos (SEDIA – Secretaría de Estado de Digitalización e Inteligencia Artificial) ni el creador (BSC – Barcelona Supercomputing Center) serán responsables de los resultados derivados del uso que hagan terceros de estos modelos.

About

This repository contains the code of the paper "Anticipating the Debate: Predicting Controversy in News with Transformer-based NLP"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published