Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features #248

Open
wants to merge 53 commits into
base: develop
Choose a base branch
from

Conversation

Lol4t0
Copy link

@Lol4t0 Lol4t0 commented Nov 13, 2015

Xavier Grangier and others added 30 commits June 29, 2014 11:33
* As STOP_WORDS are stored in unicode format we should keep our words candidates in unicode also to be able to compare candidates against dictionary correctly
* With some languages, short stopwords are linked to the next word in the sentance with no-breakable-space. To designate those stop words we should support nbsp when tokenizing.
 Russian is an example. So this fixes grangier#223
Html fetching is now done with requests

Using requests allows writing high-level code encapsulating network & html level (decoding gzip, etc)
1.0.28:

  * Move to requests as network library
Some special tags can be false positive, so we had to porcess them all to select best top node
Requests uses headers-preferred content encoding, but for HTML better choise is TAGS-preferred content encoding
Lol4t0 and others added 21 commits January 14, 2016 16:05
Moving to requests as http library made test mocks, that used urllib mocking, incorrect
This commit fixes tests by using mock_requests library for mocking, instead of urllib one.
It is not clear why it was there in the first place, as valid html does not contain such header.

Again this is not connected to the test itself.
This benefits to automatic cookie handling, keep alive connection and may be some other features
After moving to requests http backend cookies are handled correctly.

Test url http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html?hp checked working
Python 3.4, Python 3.5 added
* Requests used for images. Same http session is used for all requests.
* Analyze all possible text root nodes and select best one, do not stop on first text root node candidate
* Improve text selection filters
Config parameter is `known_context_patterns'
Default:

	{
		'known_context_patterns': [
		    {'attr': 'class', 'value': 'short-story'},
		    {'attr': 'itemprop', 'value': 'articleBody'},
		    {'attr': 'class', 'value': 'post-content'},
		    {'attr': 'class', 'value': 'g-content'},
		    {'tag': 'article'},
		]
	}
When performing network requests, use request timeout, provided by goose configuration
@Lol4t0 Lol4t0 changed the title Fix unicode processing +   support Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features Jan 23, 2016
Swallowing errors makes it difficult to understand whether something went wrong with network, goose, or target resource.

So strict mode (now default) is introduced. With this mode goose will raise Exception instead of returning empty responses.
@andreis
Copy link

andreis commented Mar 15, 2016

@grangier please merge this, Python 3 compatibility would be great to have

@adityarustgi
Copy link

@grangier +1 on merging this PR. Python3 support is really needed.

@sandeepsayone
Copy link

@grainger Pleas merge, we are no more using python2x

@lababidi
Copy link

FYI, I've produced a pypi package goose3 that can be found at https://github.com/goose3/goose3

I appreciate all the work that @grangier has done, but I really needed goose to work on python3. If you'd like to fix any bugs, tests, etc I'm more than happy to put in time to look at pull requests and merge them. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Russian articles are not extracted
7 participants