A Brazilian News Website Data Acquisition Library for Python

pyBrNews Project, made with ❤️ by Lucas Rodrigues (@NepZR).

O projeto pyBrNews é uma biblioteca em desenvolvimento capaz de realizar a aquisição de dados de notícias e comentários de plataformas de notícias brasileiras, totalmente produzida na linguagem Python e utilizando como núcleo a biblioteca requests-HTML.

A biblioteca também está disponível para download e instalação via PIP no PyPI! Acesse clicando aqui.

🇺🇸 You are reading the Portuguese Brazilian version of this README. To read the English version, click here.

📲 Instalação

Utilizando o Gerenciador de Pacotes do Python (PIP), a partir do PyPI:
```
pip install pyBrNews
```
Utilizando o Gerenciador de Pacotes do Python (PIP), a partir da fonte (GitHub):
```
pip install git+https://github.com/NepZR/pyBrNews.git
```

Gerando o arquivo Wheel e instalando diretamente da fonte (GitHub):

git clone https://github.com/NepZR/pyBrNews.git && cd pyBrNews/

python setup.py bdist_wheel

pip install dist/pyBrNews-x.x.x-py3-none-any.whl --force-reinstall

Obs.: Substitua o x.x.x pela versão correspondente.

📰 Sites e tipos de captura suportados

Nome do site	Notícias	Comentários	URL
Portal G1	✅ Funcional	⌨️ Em desenvolvimento	Link
Folha de São Paulo	✅ Funcional	✅ Funcional	Link
Exame	✅ Funcional	⚠️ Não suportado	Link
Metrópoles	⌨️ Em desenvolvimento	⌨️ Em desenvolvimento	Link

Banco de Dados: utilizando MongoDB (pyMongo), suportado desde Outubro 28, 2022. Também com suporte a sistema de arquivos local (JSON / CSV), desde Outubro 30, 2022.
Módulos responsáveis: pyBrNews.config.database.PyBrNewsDB e pyBrNews.config.database.PyBrNewsFS

Informações adicionais: para utilizar o sistema de armazenamento de arquivos localmente (JSON / CSV), defina o parâmetro use_database=False nos crawlers do pacote news. Exemplo: crawler = pyBrNews.news.g1.G1News(use_database=False). Por padrão, está definido como True e utiliza a base de dados do MongoDB da classe PyBrNewsDB.

⌨️ Métodos disponíveis para utilização

Pacote `news`

def parse_news(self,
               news_urls: List[Union[str, dict]],
               parse_body: bool = False,
               save_html: bool = True) -> Iterable[dict]:
    """
    Extracts all the data from the article in a given news platform by iterating over a URL list. Yields a 
    dictionary containing all the parsed data from the article.

    Parameters:
        news_urls (List[str]): A list containing all the URLs or a data dict to be parsed from a given platform.
        parse_body (bool): Defines if the article body will be extracted.
        save_html (bool): Defines if the HTML bytes from the article will be extracted.
    Returns:
         Iterable[dict]: Dictionary containing all the article parsed data.
    """

def search_news(self,
                keywords: List[str],
                max_pages: int = -1) -> List[Union[str, dict]]:
    """
    Extracts all the data or URLs from the news platform based on the keywords given. Returns a list containing the
    URLs / data found for the keywords.

    Parameters:
        keywords (List[str]): A list containing all the keywords to be searched in the news platform.
        max_pages (int): Number of pages to have the articles URLs extracted from. 
                         If not set, will catch until the last possible.
    Returns:
         List[Union[str, dict]]: List containing all the URLs / data found for the keywords.
    """

Pacote `config.database`

Classe PyBrNewsDB

def set_connection(self, host: str = "localhost", port: int = 27017) -> None:
    """
    Sets the connection host:port parameters for the MongoDB. By default, uses the standard localhost:27017 for
    local usage.
    
    Parameters:
         host (str): Hostname or address to connect.
         port (int): Port to be used in the connection.
    """

def insert_data(self, parsed_data: dict) -> None:
    """
    Inserts the parsed data from a news article or extracted comment into the DB Backend (MongoDB - pyMongo).
    
    Parameters: 
        parsed_data (dict): Dictionary containing the parsed data from a news article or comment.
    Returns:
        None: Shows a success message if the insertion occurred normally. If not, shows an error message.
    """

def check_duplicates(self, parsed_data: dict) -> bool:
    """
    Checks if the parsed data is already in the database and prevents from being duplicated 
    in the crawler execution.
    
    Parameters: 
        parsed_data (dict): Dictionary containing the parsed data from a news article or comment.
    Returns:
        bool: True if the given parsed data is already in the database. False if not.
    """

Classe PyBrNewsFS

def set_save_path(self, fs_save_path: str) -> None:
    """
    Sets the save path for all the exported data generated by this Class.

    Example: set_save_path(fs_save_path="/home/ubuntu/newsData/")

    Parameters:
         fs_save_path (str): Desired save path directory, ending with a slash.
    """

def to_json(self, parsed_data: dict) -> None:
    """
    Using the parsed data dictionary from a news article or a comment, export the data as an individual JSON file.

    Parameters:
        parsed_data (dict): Dictionary containing the parsed data from a news article or a comment.
    """

def export_all_data(self, full_data: List[dict]) -> None:
    """
    By a given list of dictionaries containing the parsed data from news or comments, export in a CSV file
    containing all data.

    Parameters:
        full_data (List[dict]): List containing the dictionaries of parsed data.
    """

👨🏻‍💻 Desenvolvedor do projeto

_{Lucas Darlindo Freitas Rodrigues}
_{Data Engineer | Dev. Backend Python}
_{LinkedIn (lucasdfr)}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
.idea		.idea
pyBrNews		pyBrNews
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_ENG.md		README_ENG.md
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

.idea

.idea

pyBrNews

pyBrNews

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

README_ENG.md

README_ENG.md

main.py

main.py

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

A Brazilian News Website Data Acquisition Library for Python

O projeto pyBrNews é uma biblioteca em desenvolvimento capaz de realizar a aquisição de dados de notícias e comentários de plataformas de notícias brasileiras, totalmente produzida na linguagem Python e utilizando como núcleo a biblioteca requests-HTML.

A biblioteca também está disponível para download e instalação via PIP no PyPI! Acesse clicando aqui.

🇺🇸 You are reading the Portuguese Brazilian version of this README. To read the English version, click here.

📲 Instalação

📰 Sites e tipos de captura suportados

⌨️ Métodos disponíveis para utilização

Pacote `news`

Pacote `config.database`

👨🏻‍💻 Desenvolvedor do projeto

About

Releases 3

Packages

Languages

License

NepZR/pyBrNews

Folders and files

Latest commit

History

Repository files navigation

A Brazilian News Website Data Acquisition Library for Python

O projeto pyBrNews é uma biblioteca em desenvolvimento capaz de realizar a aquisição de dados de notícias e comentários de plataformas de notícias brasileiras, totalmente produzida na linguagem Python e utilizando como núcleo a biblioteca requests-HTML.

A biblioteca também está disponível para download e instalação via PIP no PyPI! Acesse clicando aqui.

🇺🇸 You are reading the Portuguese Brazilian version of this README. To read the English version, click here.

📲 Instalação

📰 Sites e tipos de captura suportados

⌨️ Métodos disponíveis para utilização

Pacote news

Pacote config.database

👨🏻‍💻 Desenvolvedor do projeto

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Pacote `news`

Pacote `config.database`