Skip to content

Scripts to scrape large German news websites, and access to the data (1.1 million German language news articles from 01/2020 to 12/2022).

License

Notifications You must be signed in to change notification settings

kssrr/german-media-scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scraping German News Websites

Scripts to scrape large German news websites & resulting data set of one million German news articles from 01.01.2020 to 31.12.2020. To get the code, simply run

git clone https://github.com/kssrr/german-media-scrape

If you are unfamiliar with git, you can copy-paste & run the setup.R-script, which will also install the dependencies for you.

Getting the data

Direct download (compressed .tar.gz)

We assembled a demo-dataset that includes all articles between January 1st 2020 and December 31st 2022 from the media outlets taz, Zeit, Süddeutsche, Spiegel & Welt. The data set includes a little over one million German-language news articles (uncompressed ~3.5 GB) of varying length. Article titles are missing for some sites due to an earlier problem with the scrapes; we plan to add them in later versions. The data is hosted here.

The data set includes broad coverage of various impactful events that could be fruitfully analysed, like the German federal election 2021, COVID-19, the 2022 Soccer World Cup, and of course the Russian invasion of Ukraine in early 2022.

Theoretically, the scripts could also be used to scrape data going back as far as the newspapers' archives allow; simply change the corresponding code early on in the scripts where the dates (years) to scrape are specified.

Example usage

An elaborate example (topic modelling) is shown here, but you can also do a lot of interesting, more basic exploratory analysis with this kind of data, for example examine reporting on political parties:

Reporting on political parties on two German news websites.

You could also look at the salience of particular topics:

Media Attention on Ukraine after the 2022 invasion.

World Cup 2022

Or investigate pairwise correlation clusters of keywords (click to enlarge; see here for the methodology):

Network

welt_network

Special thanks to the University of Münster for providing us with additional computational resources for this project.

About

Scripts to scrape large German news websites, and access to the data (1.1 million German language news articles from 01/2020 to 12/2022).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages