News Crawler

News Crawler is the offline phase of the News Extraction and Summarization project

Tech

News Crawler uses a number of open source projects to work properly:

Crawler4j - an open source web crawler for Java
JSoup - a Java library for working with real-world HTML
Stanford CoreNLP - a set of natural language analysis tools

Installation

News Crawler requires the following JARs to run

crawler4j-4.1-jar-with-dependencies.jar
slf4j-simple-1.6.1.jar
jsoup-1.10.2.jar
mysql-connector-java-5.1.40-bin.jar
All JARs in Stanford CoreNLP Suite

Instructions

- Download the dependencies and import the project on eclipse
- Right click on project -> Build Path -> Configure Build Path -> Libraries -> Add External JAR
- Add the JARs to the class path
- Create a database and relations according to the schema diagram
- Modify the default file locations for storing temporary crawl data and file repository
- Run the CrawlController as a java application
- Run the AnaphoraAndTagging as a java application

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.settings		.settings
bin		bin
src		src
.classpath		.classpath
.project		.project
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.settings

.settings

bin

bin

src

src

.classpath

.classpath

.project

.project

README.md

README.md

Repository files navigation

News Crawler

Tech

Installation

Instructions

Authors

About

Releases

Packages

Languages

kailashkarthik9/News-Crawler

Folders and files

Latest commit

History

Repository files navigation

News Crawler

Tech

Installation

Instructions

Authors

About

Topics

Resources

Stars

Watchers

Forks

Languages