Skip to content

A scraper that extracts the article text from the Dutch news website nu.nl, used to train a MITIE NER model

License

Notifications You must be signed in to change notification settings

laurentmih/nunl_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Another episode in the series of "repurposing" a public API!

Introduction

I needed "some" text data to train MITIE and create a proper word feature extractor for the Dutch language. I figured I would use some form of modern Dutch to get started, and NU.nl seemed like an excellent first target. Opening up the Chrome console like a true hacker, it took me all of 3 minutes to find their (apparently not rate-limited) API. Hence, a quick and dirty script.

How it works

Honestly, just read the source, it's pretty straightforward. Generally:

  • Requests all the article URIs from the API with some sleeper interval between so as not to completely hammer their API
  • Stores them in a text file
  • Once it reaches the end, it starts looping through the links text file and downloads all the articles
  • Slams the extracted text in a text file

You should've used a database

I know, I know, I don't know what I was thinking, but remember I said 'quick and dirty'?

About

A scraper that extracts the article text from the Dutch news website nu.nl, used to train a MITIE NER model

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages