Skip to content

kailashkarthik9/News-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

News Crawler

News Crawler is the offline phase of the News Extraction and Summarization project

Tech

News Crawler uses a number of open source projects to work properly:

  • Crawler4j - an open source web crawler for Java
  • JSoup - a Java library for working with real-world HTML
  • Stanford CoreNLP - a set of natural language analysis tools

Installation

News Crawler requires the following JARs to run

  • crawler4j-4.1-jar-with-dependencies.jar
  • slf4j-simple-1.6.1.jar
  • jsoup-1.10.2.jar
  • mysql-connector-java-5.1.40-bin.jar
  • All JARs in Stanford CoreNLP Suite

Instructions

- Download the dependencies and import the project on eclipse
- Right click on project -> Build Path -> Configure Build Path -> Libraries -> Add External JAR
- Add the JARs to the class path
- Create a database and relations according to the schema diagram
- Modify the default file locations for storing temporary crawl data and file repository
- Run the CrawlController as a java application
- Run the AnaphoraAndTagging as a java application

Authors

About

Final Year Project. News Extraction and Summarization

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages