GitHub - helloworld0909/ForumCrawler: Crawl board page, post page and user page from ordinary forums like 1point3acres.com/bbs

ForumCrawler

A ForumCrawler crawls study aboard forums

A Crawler crawls board pages, posts and user info from ordinary forums based on Discuz, and the data will be stored in MySQL database. It can also crawl specific information like offer information from study aborad forums. It can be modified to work on some other regular forum web sites.

To run the crawler, please input in console:

python run.py <spider name>

Available spiders:

forum
gter

Dependency:

python2.7
scrapy
bs4(BeautifulSoup4)
MySQLdb
pywin32 (For Windows User)

Change Log

v0.5
Changes:

Finish offer_spider, which can crawl offer info from bbs.gter.net
Improve run.py, choose different LOG_FILE and JOBDIR for different spiders
Automatically ignore empty offer items

v0.41
Changes:

Divide settings into 2 parts:
1. General settings in /
2. Custom spider settings in /custom
Modify other components to fit this change

v0.4
Add some utils
Changes:

Add log_parser
Add cookies util
Developing gter.net spider

v0.31
Parse post context(admission info, user background, etc)
Changes:

Parse post context()
Parse admission board correctly
trivial Bugs fixed

v0.3
Finish User page parsing
Changes:

User page and profile parsing
from future import unicode_literals
Fix names of attributes
Parse board_url and board_name of each post
log filename relates to time_local()

v0.23
Finish login
Changes:

Add class variable 'cookies', and pass it on to every request

v0.22
Finish forum parser and post parser
Changes:

Finish parse_post(), PostItem
Change the name of the project
MySQL tables use MyISAM engine

v0.21
Use Rule to crawl forum, add forum info into MySQL
Changes: (Only finish forum part)

Scrape forum info
Add separate rules with respect to forum, post and user
Add separate items
Manage the process of creating tables in settings.py (TABLE_INFO)

v0.2
Only Crawl urls of board, thread and user sites
Changes:

Replace BeautifulSoup with XPath
Read cookies from json
Add Rules in ForumSpider
Add run.py

v0.1
Crawl all links under the domain.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
ForumCrawler		ForumCrawler
cookies		cookies
.gitignore		.gitignore
README.md		README.md
run.py		run.py
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ForumCrawler

ForumCrawler

cookies

cookies

.gitignore

.gitignore

README.md

README.md

run.py

run.py

scrapy.cfg

scrapy.cfg

Repository files navigation

ForumCrawler

A ForumCrawler crawls study aboard forums

To run the crawler, please input in console:

Available spiders:

Dependency:

Change Log

About

Releases

Packages

Languages

helloworld0909/ForumCrawler

Folders and files

Latest commit

History

Repository files navigation

ForumCrawler

A ForumCrawler crawls study aboard forums

To run the crawler, please input in console:

Available spiders:

Dependency:

Change Log

About

Topics

Resources

Stars

Watchers

Forks

Languages