altmetric-news-quality

Reproduction material and instructions

Methodology

RSS collection: March 1 - May 3

Twitter collection: March 4

News metadata collection: March 5

A log of all decisions made before and during the data collection process can be found in the Wiki.

Raw Data Collection

8 publications were selected to be collected through two possible distribution channels: RSS for those that maintain functioning feeds and Twitter for the rest.

Publication	URL	Channel	Details
New York Times – Science	https://www.nytimes.com/section/science	RSS	https://rss.nytimes.com/services/xml/rss/nyt/Science.xml
The Guardian – Science	https://www.theguardian.com/science	RSS	https://www.theguardian.com/science/rss
Wired – Science	https://www.wired.com/category/science/	RSS	https://www.wired.com/feed/category/science/latest/rss
Popular Science	https://www.popsci.com/	Twitter	https://twitter.com/PopSci
IFLScience	https://www.iflscience.com/	Twitter	https://twitter.com/IFLScience
HealthDay	https://consumer.healthday.com	RSS	https://consumer.healthday.com/feeds/feed.rss
News Medical	https://www.news-medical.net/	RSS	http://www.news-medical.net/syndication.axd?format=rss
MedPageToday	https://www.medpagetoday.com	RSS	https://www.medpagetoday.com/rss/headlines.xml

RSS Feeds

scripts/collect_feeds_and_sync.py

The following script was run as a cron job on our scholcommlab server during the collection date range. feedparser is used to access the actual RSS feeds as the library helps to parse various RSS versions. It uses RSS feeds defined in data/input/rss_feeds.csv.

Twitter Feeds

notebooks/1_download_twitter.ipynb

This notebook can be used at any time to collect all tweets from the publications specified in data/input/twitter_feeds.csv.

Preprocess URLs from RSS/Twitter

notebooks/2_process_channels.ipynb

This notebook processes each of the two collection processes (e.g., removal of duplicate items, removal of tweets without links) and creates two spreadsheets:

data/raw/cleaned_rss.csv: All items that we identified in the RSS feeds of 6 publications
data/raw/cleaned_tweets.csv: All tweets from two publications that contained a URL to the publications

Scrape news articles

notebooks/3_scrape_articles.ipynb

This script uses the previously created cleaned files to create a main spreadsheet with all collected URLs to news articles. Using a combination of meta tags on the publishers pages, custom HTML parsers adjusted to individual sources, and NLP-processing we collect a publication date, section information, keywords, author information, and a title for every article.

There are some caveats that need to be considered for each field due to the limitation and challenge of comparing various classifications used in different ways by news sources. While some might attempt to implement best-practices in terms of meta tags others only provide the bare minimum. Therefore, a few remarks for each collected field:

published: It is unclear how "published date" is used by some sources (published vs last-updated).
modified: Not available for three sources.
section: Difficult to compare across sources as each source uses their classifications differently. While some are quite comparable (guardian, nyt, wired), in the case of newsmed, the two available sections derive from their particular model of publication (i.e., "Medical News" & "Life Science News").
keywords: Again, similar challenges to keywords. However, for some sources the keywords were replaced by tags for a lack of other keywords. Once again, each source might be using tags and keywords differently in their own contexts. ifls and healthday did not provide any keywords or tags, however, using newspaper we could derive keywords from the text bodies.

Source	published	modified	section	keywords	author
guardian	MD.article.published_time	MD.article.modified_time	MD.article.section	MD	authors
nyt	MD.article.published_time	MD.article.modified_time	MD.article.section	MD	authors
wired	html	---	html	MD	MD.author
popsci	MD.article.published_time	MD.article.modified_time	html	html	authors
ifls	html	---	html	nlp	html
newsmed	MD.article.published_time	MD.article.modified_time	html	MD	html
medpage	MD.dc.date	---	MD.sailthru.topcat	MD	MD.sailthru.author
healthday	MD.article.published_time	MD.article.modified_time	MD.article.section	nlp	html

Note: MD is meta data extracted from the page header by newspaper. html indicates content extracted from the page content. nlp indicates keyword extraction provided by newspaper. authors is a field extracted by newspaper heuristics.

This final notebook creates an output file data/processes/articles.csv with news articles published by all 8 sources during the collection period.

Postprocessing

notebooks/4_postprocessing.ipynb

The collected articles are then filtered based on few exclusion criteria:

Articles are excluded if:

they have been published before Mar 1 or after Apr 30
they are in Spanish
they have been used in previous samples
they were not successfully parsed by newspaper

The final dataset contains 5,172 articles with the following breakdown:

Reproduction

Setup the project

This project uses poetry to manage its dependencies. Recommended (and supposedly easiest) way to get our code running:

Get a copy of this repository on your machine and cd into the folder
1. git clone git@github.com:ScholCommLab/altmetric-news-quality.git
2. cd altmetric-news-quality
Install pyenv to manage your python versions:
1. pyenv install 3.8.2
2. pyenv local 3.8.2 to set your local python version
Install poetry to install dependencies and manage the local virtualenv
1. poetry install
2. poetry shell to activate the virtualenv

Note: Make sure to check out the newspaper3k docs as the installation might require some additional software installed on your system outside of the Python universe.

Another note: On our RHEL machine, I had to also ensure that libffi-devel (libffi-dev on Ubuntu/Debian) is installed and recompile the Python distro (i.e., pyenv uninstall 3.8.2 and then pyenv install 3.8.2 again). Fun!

Lastly, feel free to use the requirements.txt to install the requirements as you wish :)

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
data		data
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
filtered_articles.png		filtered_articles.png
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

notebooks

notebooks

scripts

scripts

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

filtered_articles.png

filtered_articles.png

poetry.lock

poetry.lock

pyproject.toml

pyproject.toml

requirements.txt

requirements.txt

Repository files navigation

altmetric-news-quality

Methodology

Raw Data Collection

RSS Feeds

Twitter Feeds

Preprocess URLs from RSS/Twitter

Scrape news articles

Postprocessing

Reproduction

Setup the project

About

Releases 1

Packages

Contributors 2

Languages

License

ScholCommLab/altmetric-news-quality

Folders and files

Latest commit

History

Repository files navigation

altmetric-news-quality

Methodology

Raw Data Collection

RSS Feeds

Twitter Feeds

Preprocess URLs from RSS/Twitter

Scrape news articles

Postprocessing

Reproduction

Setup the project

About

Resources

License

Stars

Watchers

Forks

Languages