antiSMI Project

About project

The AntiSMI project explores and creates new ways of working with history through news for readers, journalists and researchers.

It's a personal analytical non-profit project at the intersection of ML and journalism, which allows using machine learning models to analyze changes in the news flow in real time, trying to create a fundamentally different way of consuming news in the conditions of changing the way of its production and in the conditions of misinformation.

The project is currently based on Russian-language news, but has plans to cover news in all key world languages.

You can use the applications and opportunities of this project right now:

Web-app (various tools to research the news flow)
Nowadays Bot (tools for working with current news)
Timemachine Bot (tools for working with past news, temporarily out of service)

Stats

Project start: 2022-07-01
Capacity: 40 news agencies, ~ 1,000 news/day
News categories: 7
Present database capacity: ~ 400,000 news articles [01.2022 - today]
Archive base capacity: ~ 1,650,000 articles [08.1999 - 01.2022]

Structure

The project consists of independent parts that deal with news from the past and/or present time that are in different repositories:

From a technical point of view, these parts can be grouped into 5 different groups (see scheme):

Scrappers [Collector and Parsers] - collects and processes agency news on a regular basis for use in the rest of the project
Databases - relational and vector databases that store news collected and processed by Scrappers
Backend - FastAPI backend - retrieves various views of news articles stored in the project databases. The backend gets these views to the frontend of applications developed within the project.
Frontend [Web-app, Nowadays Bot and Timemachine Bot] - these are different user interfaces for interacting with the project. Web-app - is the most versatile and comprehensive way, bots serve as a mobile way to interact with the current and past news stream.
Observer [Superset Visualizer] - researches social trends, make dashboards and creates NLP models. It is an ApacheSuperset based analytics system that connects to Databases and builds analytical dashboards and reports.

You can access the repositories you are interested in for more details by following the links below:

Databases and Observer are closed parts of the project. This means that you will not be able to reproduce these project data using the source repositories, and docker / docker-compose files, but you will be able to learn and easily understand how to build a similar service yourself.

Stack

Language: python, sql
Databases: postgreSQL + pgvector, sqlalchemy
Validation: pydantic
Logging: loguru
BI: apache SuperSet

Self deploy

It is possible to compose the main parts of the project (Database, Backend and Web-app) using the docker-compose.yml file, which is located in the root of this repository (you must have access to the files in steps 2-3).

Clone the 2 repositories on your server side by side into the root of build directory using git clone https://github.com/data-silence/antiSMI-backend and git clone https://github.com/data-silence/antiSMI-app
Create a db directory and copy docker-compose.yml into the root of build directory
Copy the file with the required environment variables for each part of the project .env-non-dev into the root of this part directory. Create directory models and copy the categorisation model file cat_model.ftz into it
Make sure that docker is installed on the server.
Start building the project using docker compose up -d
Your database starts on port 5432, your API on port 8000, and your web application on 8501

Pipeline

Scraping
- requests
- beautifulsoup4
Summarization
- mBart, Seq2Seq, pretrained [news summary]
- ruT5, pre-trained [headline]
Categorization
- fasttext, supervised pre-training, 7 classes (categories)
Clustering
- Navec glove-embeddings (trained on news corpus)
- sklearn: agglomerative clustering by cosine distance with tuned thresholding
Interaction Interface
- pyTelegramBot [user interface]
- SuperSet [analytics, dashboards]

Development Tools

Pycharm
Docker
GitHub
Linux shell

Plans

Common purpose
- replicate the project for coverage on the news agenda in other countries
Collector
- increase source coverage: add parsing of English-language, Ukrainian and pro-state news agencies
Frontend
- audio digests
- training a neural network model for generating news photos
Observer
- deploy a remote Superset server
- increase dashboard coverage of news streams and agency structure
- write and publish an article based on the results of research and dashboards

Contact info

📬 enjoy-ds@pm.me

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
img		img
site		site
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

img

img

site

site

README.md

README.md

docker-compose.yml

docker-compose.yml

Repository files navigation

antiSMI Project

Table of contents

About project

Stats

Structure

Stack

Self deploy

Pipeline

Development Tools

Plans

Contact info

About

Releases

Packages

Languages

data-silence/antiSMI-Project

Folders and files

Latest commit

History

Repository files navigation

antiSMI Project

Table of contents

About project

Stats

Structure

Stack

Self deploy

Pipeline

Development Tools

Plans

Contact info

About

Topics

Resources

Stars

Watchers

Forks

Languages