Skip to content

Dataset of stories used for 2018 paper: Measuring News Similarity across ten U.S. Sites

Notifications You must be signed in to change notification settings

grantat/news-similarity

Repository files navigation

News Similarity Data

The directory contains all the data collected and parsed throughout the WSDL group's news similarity project. We retrieved stories from the following websites:

The directories are described as follows:

  • The timemaps directory contains the timemap of each of the news sites.
  • The mementos directory contains the mementos closest to 1AM GMT every day from 2016-05-01 to 2017-05-31, collected from the Internet Archive. The directories are named according to a website's md5 hash which can be seen in news-websites-hashes.json.
  • The stories/if_/ directory contains the news stories retrieved from the Internet Archive without banner/HTML injections.
  • The col_sim directory contains the similarity calculations per day for the links where k = 1, 3, 10.
  • The error directory contains files related to failed requests to the Internet Archive.

Aside from those directories there are also some JSON and CSV files that are subsets for other parts of this project, usually summarizing the data. For example, links_per_day.json describes the links used per day to find the similarity for k = 10 stories from each news site. The col_sim directories also contain summary files named col_sim_summary.csv of the similarity values.

About

Dataset of stories used for 2018 paper: Measuring News Similarity across ten U.S. Sites

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages