News Similarity Data

The directory contains all the data collected and parsed throughout the WSDL group's news similarity project. We retrieved stories from the following websites:

The directories are described as follows:

The timemaps directory contains the timemap of each of the news sites.
The mementos directory contains the mementos closest to 1AM GMT every day from 2016-05-01 to 2017-05-31, collected from the Internet Archive. The directories are named according to a website's md5 hash which can be seen in news-websites-hashes.json.
The stories/if_/ directory contains the news stories retrieved from the Internet Archive without banner/HTML injections.
The col_sim directory contains the similarity calculations per day for the links where k = 1, 3, 10.
The error directory contains files related to failed requests to the Internet Archive.

Aside from those directories there are also some JSON and CSV files that are subsets for other parts of this project, usually summarizing the data. For example, links_per_day.json describes the links used per day to find the similarity for k = 10 stories from each news site. The col_sim directories also contain summary files named col_sim_summary.csv of the similarity values.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
col_sim		col_sim
errors		errors
links_per_day		links_per_day
mementos-per-month		mementos-per-month
mementos		mementos
parsed_links		parsed_links
stories/if_		stories/if_
timemaps		timemaps
.gitignore		.gitignore
README.md		README.md
headline-counts.csv		headline-counts.csv
news-websites-hashes.json		news-websites-hashes.json
news-websites.txt		news-websites.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

col_sim

col_sim

errors

errors

links_per_day

links_per_day

mementos-per-month

mementos-per-month

mementos

mementos

parsed_links

parsed_links

stories/if_

stories/if_

timemaps

timemaps

.gitignore

.gitignore

README.md

README.md

headline-counts.csv

headline-counts.csv

news-websites-hashes.json

news-websites-hashes.json

news-websites.txt

news-websites.txt

Repository files navigation

News Similarity Data

About

Releases 2

Packages

Languages

grantat/news-similarity

Folders and files

Latest commit

History

Repository files navigation

News Similarity Data

About

Resources

Stars

Watchers

Forks

Languages