Introduction

Another episode in the series of "repurposing" a public API!

Introduction

I needed "some" text data to train MITIE and create a proper word feature extractor for the Dutch language. I figured I would use some form of modern Dutch to get started, and NU.nl seemed like an excellent first target. Opening up the Chrome console like a true hacker, it took me all of 3 minutes to find their (apparently not rate-limited) API. Hence, a quick and dirty script.

How it works

Honestly, just read the source, it's pretty straightforward. Generally:

Requests all the article URIs from the API with some sleeper interval between so as not to completely hammer their API
Stores them in a text file
Once it reaches the end, it starts looping through the links text file and downloads all the articles
Slams the extracted text in a text file

You should've used a database

I know, I know, I don't know what I was thinking, but remember I said 'quick and dirty'?

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

Introduction

How it works

You should've used a database

About

Releases

Packages

Languages

License

laurentmih/nunl_scraper

Folders and files

Latest commit

History

Repository files navigation

Introduction

How it works

You should've used a database

About

Topics

Resources

License

Stars

Watchers

Forks

Languages