pyBrNews Project, made with ❤️ by Lucas Rodrigues (@NepZR).
O projeto pyBrNews é uma biblioteca em desenvolvimento capaz de realizar a aquisição de dados de notícias e comentários de plataformas de notícias brasileiras, totalmente produzida na linguagem Python e utilizando como núcleo a biblioteca requests-HTML.
A biblioteca também está disponível para download e instalação via PIP no PyPI! Acesse clicando aqui.
🇺🇸 You are reading the Portuguese Brazilian version of this README. To read the English version, click here.
- Utilizando o Gerenciador de Pacotes do Python (PIP), a partir do PyPI:
pip install pyBrNews
- Utilizando o Gerenciador de Pacotes do Python (PIP), a partir da fonte (GitHub):
pip install git+https://github.com/NepZR/pyBrNews.git
- Gerando o arquivo Wheel e instalando diretamente da fonte (GitHub):
git clone https://github.com/NepZR/pyBrNews.git && cd pyBrNews/
python setup.py bdist_wheel
pip install dist/pyBrNews-x.x.x-py3-none-any.whl --force-reinstall
Obs.: Substitua o x.x.x pela versão correspondente.
Nome do site | Notícias | Comentários | URL |
Portal G1 | ✅ Funcional | ⌨️ Em desenvolvimento | Link |
Folha de São Paulo | ✅ Funcional | ✅ Funcional | Link |
Exame | ✅ Funcional | Link | |
Metrópoles | ⌨️ Em desenvolvimento | ⌨️ Em desenvolvimento | Link |
Banco de Dados: utilizando MongoDB (pyMongo), suportado desde Outubro 28, 2022. Também com suporte a sistema de arquivos local (JSON / CSV), desde Outubro 30, 2022.
Módulos responsáveis:pyBrNews.config.database.PyBrNewsDB
epyBrNews.config.database.PyBrNewsFS
Informações adicionais: para utilizar o sistema de armazenamento de arquivos localmente (JSON / CSV), defina o parâmetro
use_database=False
nos crawlers do pacotenews
. Exemplo:crawler = pyBrNews.news.g1.G1News(use_database=False)
. Por padrão, está definido comoTrue
e utiliza a base de dados do MongoDB da classePyBrNewsDB
.
def parse_news(self,
news_urls: List[Union[str, dict]],
parse_body: bool = False,
save_html: bool = True) -> Iterable[dict]:
"""
Extracts all the data from the article in a given news platform by iterating over a URL list. Yields a
dictionary containing all the parsed data from the article.
Parameters:
news_urls (List[str]): A list containing all the URLs or a data dict to be parsed from a given platform.
parse_body (bool): Defines if the article body will be extracted.
save_html (bool): Defines if the HTML bytes from the article will be extracted.
Returns:
Iterable[dict]: Dictionary containing all the article parsed data.
"""
def search_news(self,
keywords: List[str],
max_pages: int = -1) -> List[Union[str, dict]]:
"""
Extracts all the data or URLs from the news platform based on the keywords given. Returns a list containing the
URLs / data found for the keywords.
Parameters:
keywords (List[str]): A list containing all the keywords to be searched in the news platform.
max_pages (int): Number of pages to have the articles URLs extracted from.
If not set, will catch until the last possible.
Returns:
List[Union[str, dict]]: List containing all the URLs / data found for the keywords.
"""
- Classe
PyBrNewsDB
def set_connection(self, host: str = "localhost", port: int = 27017) -> None:
"""
Sets the connection host:port parameters for the MongoDB. By default, uses the standard localhost:27017 for
local usage.
Parameters:
host (str): Hostname or address to connect.
port (int): Port to be used in the connection.
"""
def insert_data(self, parsed_data: dict) -> None:
"""
Inserts the parsed data from a news article or extracted comment into the DB Backend (MongoDB - pyMongo).
Parameters:
parsed_data (dict): Dictionary containing the parsed data from a news article or comment.
Returns:
None: Shows a success message if the insertion occurred normally. If not, shows an error message.
"""
def check_duplicates(self, parsed_data: dict) -> bool:
"""
Checks if the parsed data is already in the database and prevents from being duplicated
in the crawler execution.
Parameters:
parsed_data (dict): Dictionary containing the parsed data from a news article or comment.
Returns:
bool: True if the given parsed data is already in the database. False if not.
"""
- Classe
PyBrNewsFS
def set_save_path(self, fs_save_path: str) -> None:
"""
Sets the save path for all the exported data generated by this Class.
Example: set_save_path(fs_save_path="/home/ubuntu/newsData/")
Parameters:
fs_save_path (str): Desired save path directory, ending with a slash.
"""
def to_json(self, parsed_data: dict) -> None:
"""
Using the parsed data dictionary from a news article or a comment, export the data as an individual JSON file.
Parameters:
parsed_data (dict): Dictionary containing the parsed data from a news article or a comment.
"""
def export_all_data(self, full_data: List[dict]) -> None:
"""
By a given list of dictionaries containing the parsed data from news or comments, export in a CSV file
containing all data.
Parameters:
full_data (List[dict]): List containing the dictionaries of parsed data.
"""
Lucas Darlindo Freitas Rodrigues Data Engineer | Dev. Backend Python LinkedIn (lucasdfr) |