News Recommendation Web Application

A LLM-based News article recommendation web application. Receives URL or a text corpus as input, performs index searching with Facebook AI Similarity Search (FAISS), and returns hyperlinks of the most similar articles in the database. If URL is used as an input, the application will automatically run a scraper to scrape the articles.

All the articles, including the input, will be summarized with a Bart model by default (or a T5 model) before getting encoded by a SentenceTransformer, which maps sentences and paragraphs to a 384-dimensional dense vector space. The index search algorithm will then compare the input vector and the database, returning the indices of the Top 5 results. The final output will be retrieved from our dataset based on the indices.

We currently host our example dataset on AWS S3, which has 9800+ entries, and it is publicly available. Instead of the whole dataset, we stored the partitions of it so that we only need to retrieve the partitions that contain the final indices. We have also defined functions to automate the process of dataset partitioning, uploading and reading from AWS S3.

Project Structure:

Project Structure Tree:

News-Recommendation/
│
├── app/
│   ├── static/
│   │   └── styles.css            # CSS styles
│   ├── templates/
│   │   ├── index.html            # Main page template
│   │   └── results.html          # Results display template
│   │   └── error.html            # Error display template
│   ├── __init__.py               # Initialize Flask app
│   └── routes.py                 # Flask routes
│
├── dataset/
│   ├── partitioned_embeddings/   # Example vector dataset partitions
│   ├── partitioned_nyt/          # Example dataset partitions
│   └── embeddings.npy            # Encoded vectors of dataset
│
├── news_articles/                # Scrapy project files
│   ├── news_articles/
│   │   ├── spiders/
│   │   │   └── __init__.py
│   │   │   └── news_spider.py    # Spider class definition
│   │   └── __init__.py     
│   │   └── items.py
│   │   └── middlewares.py
│   │   └── pipelines.py
│   │   └── settings.py
│   │   └── spider_runner.py      # Spider runner definition
│   └── scrapy.cfg              
│
├── utils/
│   ├── vectorizer.py             # Script for article vectorization
│   └── indexer.py                # Script for creating and querying the index
│   └── util.py                   # Script for dataset partition and interaction with S3
│
├── requirements.txt              # Python dependencies
└── run.py                        # Entry point to run the Flask app

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
app		app
dataset		dataset
news_articles		news_articles
utils		utils
.gitignore		.gitignore
Project structure.png		Project structure.png
README.md		README.md
creds.py		creds.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

dataset

dataset

news_articles

news_articles

utils

utils

.gitignore

.gitignore

Project structure.png

Project structure.png

README.md

README.md

creds.py

creds.py

requirements.txt

requirements.txt

run.py

run.py

Repository files navigation

News Recommendation Web Application

Project Structure:

Project Structure Tree:

About

Releases

Packages

Contributors 2

Languages

K0EKJE/LLM-Based-News-Recommendation

Folders and files

Latest commit

History

Repository files navigation

News Recommendation Web Application

Project Structure:

Project Structure Tree:

About

Topics

Resources

Stars

Watchers

Forks

Languages