GitHub - Valkryst/RSS_News_Scraper: A script for building a local database of news articles by crawling RSS feeds and downloading the articles.

This script can be used to build a local database of news articles by crawling RSS feeds and downloading the articles.

It can be run periodically (e.g. VIA cron) to keep the database up to date.
It stores the articles in a local folder, in a simple format that can be easily loaded into memory.
It does not allow multiple instances to run at the same time.
It employs a cache to avoid downloading the same articles multiple times.
It uses simple methods to avoid being blacklisted by websites, though no guarantees are made.

Usage

Download main.py and requirements.txt into a folder. Preferably, create a new folder for this purpose.
Install the required packages by running pip install -r requirements.txt.
Create a rss_urls.txt file and enter the RSS feed URLs you want to crawl. Each URL should be on a new line.
Run python main.py to start the crawler, or create a cron job to run it periodically.

Output

The crawler will create a data folder and store the crawled data in it. Articles are grouped by publication date and stored in files named YYYY-MM-DD.pkl.

The .pkl files can be loaded as follows:

import os
import pickle

articles = {}

for file in os.listdir('data/articles'):
    with open(os.path.join('data/articles', file), 'rb') as f:
        articles[file.replace('.pkl', '')] = pickle.load(f)

Each entry in the articles dict uses the following format:

'YYYY-MM-DD': [
    {
        'url': 'example.com/articles/1',
        'title': 'Lorem ipsum dolor sit amet',
        'body': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.'
    },
    {
        'url': 'example.com/articles/2',
        'title': 'Lorem ipsum dolor sit amet',
        'body': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.'
    }
    # ...
]

Logging

You can control the logging level by setting the LOG_LEVEL environment variable. The default level is WARNING and the list of levels can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github		.github
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

.gitignore

.gitignore

LICENSE.md

LICENSE.md

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

Usage

Output

Logging

About

Sponsor this project

Contributors 2

Languages

License

Valkryst/RSS_News_Scraper

Folders and files

Latest commit

History

Repository files navigation

Usage

Output

Logging

About

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Languages