GitHub

栗子： http://www.yinhang.com/licaichanpin_gRrgLT98.html

以前一直觉得爬虫的难度可能在 ip ，频次，模拟鼠标操作等等，从来不觉得解析 html 是问题，但是这次是真给跪了。

网页上的数字和标点都是图片生成的，而且每次请求图片都不一样。

第一次遇见这种，如果爬不了，多问一句，有没有这样子的开源库可以拿来做防爬虫。。

main.py中为尝试使用tesseract进行图片字符识别，效果不好(可能是姿势不对~~)。

_为 icedx@v2ex同学提供，二值化图片后计算相似度，效果不错(推荐！)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
_		_
.gitignore		.gitignore
1.json		1.json
5870b.css		5870b.css
5870b.png		5870b.png
main.py		main.py
readme.md		readme.md
requirement.txt		requirement.txt

Provide feedback