Predicting Newsworthiness

This repository contains the data needed to replicate the findings of the our article From Crowd Ratings to Predictive Models of Newsworthiness to Support Science Journalism, published in the Proceedings of the ACM on Human-Computer Interaction, and presented at CSCW 2022.

`train.json`

Crowdsourced dataset of ratings for the news values of different arXiv articles (n=500). Used to train Extra Trees model Please refer to Section 5.1 of our paper for details about model training with this data. Contains the following fields:

arxiv_id: Unique identifiers for arXiv articles, sourced from arXiv API.
arxiv_url: URLs for arXiv articles, sourced from arXiv API.
title: Titles for arXiv articles, sourced from arXiv API.
summary: Abstracts for arXiv articles, sourced from arXiv API.
published: Date of publication for arXiv articles, sourced from arXiv API.
authors: Authors for arXiv articles, sourced from arXiv API.
arxiv_primary_category: Author-provided primary category for arXiv articles, sourced from arXiv API.
readability: Readability score for article's summary field, assigned by De-Jargonizer, and scaled to be from 0-1.
actuality: Score for actuality news value, assigned byt MTurk crowdworkers, range 1-5.
controversy: Score for controversy news value, assigned byt MTurk crowdworkers, range 1-5.
relevance_magnitude: Score for relevance_magnitude news value, assigned byt MTurk crowdworkers, range 1-5.
relevance_valence: Score for relevance_valence news value, assigned byt MTurk crowdworkers, range 1-5.
newsworthiness_crowd_sum: Average of the four news values - actuality, controversy, relevance_magnitude, relevance_valence, range 1-5. Binarized at the value of 3 for training the newsworthiness classification model.

`validate.json`

Crowdsourced dataset of ratings for the news values of different arXiv articles (n=55). Also contains expert evaluations of newsworthiness for this data. Used to evaluate Extra Trees model. Please refer to Section 5.2 of our paper for details on findings.

In addition to the fields found in train.json, this data also contains the following:

nw_expert1: Score for newsworthiness assigned by expert 1, range 1-5.
nw_expert2: Score for newsworthiness assigned by expert 2 , range 1-5.
newsworthiness_expert: Average of the both experts' ratings for newsworthiness, range 1-5.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.ipynb_checkpoints		.ipynb_checkpoints
LICENSE		LICENSE
README.md		README.md
train.json		train.json
validate.json		validate.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

LICENSE

LICENSE

README.md

README.md

train.json

train.json

validate.json

validate.json

Repository files navigation

Predicting Newsworthiness

`train.json`

`validate.json`

About

Releases

Packages

License

comp-journalism/predicting_newsworthiness

Folders and files

Latest commit

History

Repository files navigation

Predicting Newsworthiness

train.json

validate.json

About

Resources

License

Stars

Watchers

Forks

`train.json`

`validate.json`