Skip to content

mohammedjasam/CNN-Scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CNN Scraper

This script scrapes CNN article pages for their word frequency and then creates a data matrix which is later subjected to various similarity functions to analyze the similarity of the articles.

How to Run

Scripted in Python 3.6 but needs python 2.7+ too :)

  • Run "python3 scrapper.py"
  • Use the Data.csv in Similarity Analyzer Folder
  • Run "python3 Parallel.py" (This step needs python 2.7, so make sure you've installed them both)

Requirements:

  • "article_list" contains all the list of urls which can be obtained by running the crawler "article_url"
  • beautifulsoup4 (4.5.1)
  • lxml
  • nltk
  • SciPy

Output:

A data file called data.csv is saved. It contains a list of word frequencies associated with each article. Output files of Euclidean, Jaccard and Cosine Distances are generated to analyze the similarity of the articles.

About

Collects articles from CNN.com and performs various algorithms on it to find out similarities between articles.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages