Skip to content

vishrutJha/Concept-Tagger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Concept-Tagger

Web Crawler with concept tagging for crawled websites

Requirements:

  • python 2.7+
  • mongoDB

Dependencies:

  • jsoup
  • MongoDriver java
  • MongoDriver PHP
  • pymongo
  • urllib2

Set UP:

1> Install mongodb on your machine:

Follow guide as per your OS:

http://docs.mongodb.org/manual/installation/

2> Set up python dependencies:

	sudo apt-get install python-pip

for windows download pip package for python

https://pypi.python.org/pypi/pip

pip install pymongo
pip install urllib2  # may not be required

3> Download java dependencies

download the ones stated before or use the ones included in the package\res

4> Working:

i) If you an eclipse lover:
- import this structure as is into eclipse
- import jar files and link'em up
- F11
- run the conceptTagger.py

ii) Terminal Fools:
- enter the src folder
- compile with libraries:
	
    javac -classpath ../res/<the jar file> WebDoc.java

- run the java file

    java WebDoc

    This will keep crawling and get all domain names to populate your database
    + coming soon a multithreaded crawler 
- enter /res, run the python file

    python conceptTagger.py

5> Setting up the Front end: I hope you have an Apache server/Nginx server configured with php-fpm or php-cgi - If not, please follow some guide for that, it won't be covered here

i) setting up the PHP-MongoDB driver: - Refer http://www.php.net/manual/en/mongo.installation.php - Unix/linux installation is pretty straightforward - Windows: * Check the compiler version on phpinfo() and proceed with the appropriate dll * included x86 thread_safe mongodb dlls for vc2009 and 2011 in /res

ii) the MongoClient() object: - Refer simple tutorial on php-mongodb interface: http://www.tutorialspoint.com/mongodb/mongodb_php.htm - The tagsearch.php is an API which is called with the parameter "q" for query - search_tag.html has the interface required for the stuff

iii) Directory Structure: - Copy the contents inside Front End folder "as is" into your server - the static folder contains all css, js, additional libraries and codes

6> In case off doubts, leave a comment

About

Web Crawler with concept tagging for crawled websites

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published