Skip to content

Flask-based web application designed for similarity searches on news articles, which can be generalized to any text corpus. Input a paragraph or url and returns the most similar news articles from the database.

Notifications You must be signed in to change notification settings

K0EKJE/LLM-Based-News-Recommendation

Repository files navigation

News Recommendation Web Application

A LLM-based News article recommendation web application. Receives URL or a text corpus as input, performs index searching with Facebook AI Similarity Search (FAISS), and returns hyperlinks of the most similar articles in the database. If URL is used as an input, the application will automatically run a scraper to scrape the articles.

All the articles, including the input, will be summarized with a Bart model by default (or a T5 model) before getting encoded by a SentenceTransformer, which maps sentences and paragraphs to a 384-dimensional dense vector space. The index search algorithm will then compare the input vector and the database, returning the indices of the Top 5 results. The final output will be retrieved from our dataset based on the indices.

We currently host our example dataset on AWS S3, which has 9800+ entries, and it is publicly available. Instead of the whole dataset, we stored the partitions of it so that we only need to retrieve the partitions that contain the final indices. We have also defined functions to automate the process of dataset partitioning, uploading and reading from AWS S3.

Project Structure:

Project Structure Tree:

News-Recommendation/
│
├── app/
│   ├── static/
│   │   └── styles.css            # CSS styles
│   ├── templates/
│   │   ├── index.html            # Main page template
│   │   └── results.html          # Results display template
│   │   └── error.html            # Error display template
│   ├── __init__.py               # Initialize Flask app
│   └── routes.py                 # Flask routes
│
├── dataset/
│   ├── partitioned_embeddings/   # Example vector dataset partitions
│   ├── partitioned_nyt/          # Example dataset partitions
│   └── embeddings.npy            # Encoded vectors of dataset
│
├── news_articles/                # Scrapy project files
│   ├── news_articles/
│   │   ├── spiders/
│   │   │   └── __init__.py
│   │   │   └── news_spider.py    # Spider class definition
│   │   └── __init__.py     
│   │   └── items.py
│   │   └── middlewares.py
│   │   └── pipelines.py
│   │   └── settings.py
│   │   └── spider_runner.py      # Spider runner definition
│   └── scrapy.cfg              
│
├── utils/
│   ├── vectorizer.py             # Script for article vectorization
│   └── indexer.py                # Script for creating and querying the index
│   └── util.py                   # Script for dataset partition and interaction with S3
│
├── requirements.txt              # Python dependencies
└── run.py                        # Entry point to run the Flask app

About

Flask-based web application designed for similarity searches on news articles, which can be generalized to any text corpus. Input a paragraph or url and returns the most similar news articles from the database.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published