Skip to content

Extract news from hundreds digital media, create summarizations and synthesize contents to democratize global information

License

Notifications You must be signed in to change notification settings

Marlup/Expected-news

Repository files navigation

Web Scraping for News Data Extraction

  1. Overview
  2. Running the Script
  3. Functions

Overview

This Python application is designed to extract news data from pre-extracted news media section URLs stored in file final_url_sections_vX.json (where, X is an integer for the file version), filter and clean the data, optionally summarize the article bodies using the OpenAI API for gpt model, and stores the information in a structured database. The script is organized into several functions, each serving a specific purpose in the data extraction process.

The web scrapring process is designed to be executed in a multi-process environment to improve efficiency.

This script can be a valuable tool for collecting and analyzing news data from various sources.

Running the Script

In order to execute the main process and being able to see the news extracted in a webpage locally hosted and backed with django, refer to the helper file run-project-end-to-end.txt .

The script can be run to extract news data from various media sources. It processes a list of media URLs and their associated sections, extracting news URLs, and storing the data in a structured database.

Please note that the script relies on external APIs, so you'll need to set up the necessary API key as an environment-variable this way:

OPENAI_API_KEY = yourOpenAIAPIKey

Functions:

extract_data_from_jsons

This function extracts news data from JSON objects within the HTML content of a news webpage. It looks for various attributes like the article body, article type, creation date, and more. The data is stored in a dictionary and returned.

extract_data_from_metadata

This function extracts additional data from meta tags in the HTML of the news webpage. It retrieves information like the title, description, creation date, and more. The extracted data is added to the dictionary passed to the function.

extract_keys_with_gpt

This function uses the GPT API to extract additional keys from the text content of the webpage. It looks for relevant information such as the number of tokens, title, tags, creation date, and more.

find_body_with_gpt

This function extracts the main body of the news article using the GPT API. It processes the text content of the webpage and returns the article body and the number of tokens.

get_body_summary

This function uses the GPT API to summarize the article body, providing a brief summary of the news content. It takes the text, URL, and media name as input and returns the summarized body and the number of tokens. Note that you can disable the API calling by assigning True the constant BLOCK_API_CALL from constants.py file, where application variables are stored.

treat_raw_urls

This function processes a list of raw news URLs, filtering out invalid or undesirable URLs. It also interacts with a database to determine which URLs are already stored.

find_valid_urls

This function filters out invalid URLs based on various criteria, including query symbols, file extensions, and specific URL patterns. It also checks if URLs are already present in the database.

find_urls_data

This function processes a list of news URLs, extracts the necessary data from each URL, and stores it in a structured format. It also identifies URLs that couldn't be processed.

run_multi_process

This function orchestrates the multi-process execution of the web scraping script. It divides the workload among multiple processes, each handling a specific set of media URLs.

main_multi_threading_process

This function represents the main logic for processing news URLs in a multi-threaded environment. It iterates through media URLs, retrieves the HTML content of news sections, and processes the news URLs.

filter_old_url

This function checks the publication date of a news article to determine if it's too old. If the article is older than a specified threshold, it is filtered out.

order_dict_keys

This function arranges the keys in a dictionary in a specified order, making the resulting dictionary more structured and easier to work with.

read_stored_news and create_news_table

These functions interact with a database to store and retrieve news data. read_stored_news retrieves URLs that are already stored, and create_news_table initializes the database table if it doesn't exist.

insert_news

This function inserts news data into the database table, making it available for later retrieval and analysis.

About

Extract news from hundreds digital media, create summarizations and synthesize contents to democratize global information

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages