The AntiSMI project explores and creates new ways of working with history through news for readers, journalists and researchers.
It's a personal analytical non-profit project at the intersection of ML and journalism, which allows using machine learning models to analyze changes in the news flow in real time, trying to create a fundamentally different way of consuming news in the conditions of changing the way of its production and in the conditions of misinformation.
The project is currently based on Russian-language news, but has plans to cover news in all key world languages.
You can use the applications and opportunities of this project right now:
- Web-app (various tools to research the news flow)
- Nowadays Bot (tools for working with current news)
- Timemachine Bot (tools for working with past news, temporarily out of service)
- Project start: 2022-07-01
- Capacity: 40 news agencies, ~ 1,000 news/day
- News categories: 7
- Present database capacity: ~ 400,000 news articles [01.2022 - today]
- Archive base capacity: ~ 1,650,000 articles [08.1999 - 01.2022]
The project consists of independent parts that deal with news from the past and/or present time that are in different repositories:
From a technical point of view, these parts can be grouped into 5 different groups (see scheme):
- Scrappers [Collector and Parsers] - collects and processes agency news on a regular basis for use in the rest of the project
- Databases - relational and vector databases that store news collected and processed by Scrappers
- Backend - FastAPI backend - retrieves various views of news articles stored in the project databases. The backend gets these views to the frontend of applications developed within the project.
- Frontend [Web-app, Nowadays Bot and Timemachine Bot] - these are different user interfaces for interacting with the project. Web-app - is the most versatile and comprehensive way, bots serve as a mobile way to interact with the current and past news stream.
- Observer [Superset Visualizer] - researches social trends, make dashboards and creates NLP models. It is an ApacheSuperset based analytics system that connects to Databases and builds analytical dashboards and reports.
You can access the repositories you are interested in for more details by following the links below:
Databases and Observer are closed parts of the project. This means that you will not be able to reproduce these project data using the source repositories, and docker / docker-compose files, but you will be able to learn and easily understand how to build a similar service yourself.
- Language: python, sql
- Databases: postgreSQL + pgvector, sqlalchemy
- Validation: pydantic
- Logging: loguru
- BI: apache SuperSet
It is possible to compose the main parts of the project (Database, Backend and Web-app) using the docker-compose.yml
file, which is located in the root of this repository
(you must have access to the files in steps 2-3).
- Clone the 2 repositories on your server side by side into the root of build directory using
git clone https://github.com/data-silence/antiSMI-backend
andgit clone https://github.com/data-silence/antiSMI-app
- Create a
db
directory and copydocker-compose.yml
into the root of build directory - Copy the file with the required environment variables for each part of the project
.env-non-dev
into the root of this part directory. Create directorymodels
and copy the categorisation model filecat_model.ftz
into it - Make sure that docker is installed on the server.
- Start building the project using
docker compose up -d
- Your database starts on port 5432, your API on port 8000, and your web application on 8501
- Scraping
- requests
- beautifulsoup4
- Summarization
- mBart, Seq2Seq, pretrained [news summary]
- ruT5, pre-trained [headline]
- Categorization
- fasttext, supervised pre-training, 7 classes (categories)
- Clustering
- Navec glove-embeddings (trained on news corpus)
- sklearn: agglomerative clustering by cosine distance with tuned thresholding
- Interaction Interface
- pyTelegramBot [user interface]
- SuperSet [analytics, dashboards]
- Pycharm
- Docker
- GitHub
- Linux shell
- Common purpose
- replicate the project for coverage on the news agenda in other countries
- Collector
- increase source coverage: add parsing of English-language, Ukrainian and pro-state news agencies
- Frontend
- audio digests
- training a neural network model for generating news photos
- Observer
- deploy a remote Superset server
- increase dashboard coverage of news streams and agency structure
- write and publish an article based on the results of research and dashboards