Concept-Tagger

Web Crawler with concept tagging for crawled websites

Requirements:

python 2.7+
mongoDB

Dependencies:

jsoup
MongoDriver java
MongoDriver PHP
pymongo
urllib2

Set UP:

1> Install mongodb on your machine:

Follow guide as per your OS:

http://docs.mongodb.org/manual/installation/

2> Set up python dependencies:

	sudo apt-get install python-pip

for windows download pip package for python

https://pypi.python.org/pypi/pip

pip install pymongo
pip install urllib2  # may not be required

3> Download java dependencies

download the ones stated before or use the ones included in the package\res

4> Working:

i) If you an eclipse lover:
- import this structure as is into eclipse
- import jar files and link'em up
- F11
- run the conceptTagger.py

ii) Terminal Fools:
- enter the src folder
- compile with libraries:
	
    javac -classpath ../res/<the jar file> WebDoc.java

- run the java file

    java WebDoc

    This will keep crawling and get all domain names to populate your database
    + coming soon a multithreaded crawler 
- enter /res, run the python file

    python conceptTagger.py

5> Setting up the Front end: I hope you have an Apache server/Nginx server configured with php-fpm or php-cgi - If not, please follow some guide for that, it won't be covered here

i) setting up the PHP-MongoDB driver: - Refer http://www.php.net/manual/en/mongo.installation.php - Unix/linux installation is pretty straightforward - Windows: * Check the compiler version on phpinfo() and proceed with the appropriate dll * included x86 thread_safe mongodb dlls for vc2009 and 2011 in /res

ii) the MongoClient() object: - Refer simple tutorial on php-mongodb interface: http://www.tutorialspoint.com/mongodb/mongodb_php.htm - The tagsearch.php is an API which is called with the parameter "q" for query - search_tag.html has the interface required for the stuff

iii) Directory Structure: - Copy the contents inside Front End folder "as is" into your server - the static folder contains all css, js, additional libraries and codes

6> In case off doubts, leave a comment

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.settings		.settings
Front End		Front End
bin		bin
res		res
src		src
.classpath		.classpath
.project		.project
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.settings

.settings

Front End

Front End

bin

bin

res

res

src

src

.classpath

.classpath

.project

.project

README.md

README.md

Repository files navigation

Concept-Tagger

About

Releases

Packages

Languages

vishrutJha/Concept-Tagger

Folders and files

Latest commit

History

Repository files navigation

Concept-Tagger

About

Resources

Stars

Watchers

Forks

Languages