Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(url): canonicalize_url('http://您好.中国:80/') failed #98

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

hidva
Copy link

@hidva hidva commented Sep 14, 2017

expect: 'http://xn--5usr0o.xn--fiqs8s:80/'
actual: 'http://xn--5usr0o.xn--:80-u68dy61b/'

w3lib/url.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Sep 14, 2017

Codecov Report

Merging #98 into master will increase coverage by 0.73%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #98      +/-   ##
==========================================
+ Coverage   94.88%   95.61%   +0.73%     
==========================================
  Files           7        7              
  Lines         469      593     +124     
  Branches       95      132      +37     
==========================================
+ Hits          445      567     +122     
- Misses         16       17       +1     
- Partials        8        9       +1
Impacted Files Coverage Δ
w3lib/url.py 97.21% <100%> (+0.32%) ⬆️
w3lib/encoding.py 100% <0%> (ø) ⬆️

@hidva
Copy link
Author

hidva commented Sep 14, 2017

ERROR: pypy: InterpreterNotFound: pypy

@redapple
Copy link
Contributor

redapple commented Oct 5, 2017

A good catch @pp-qq !
This also affects safe_url_string() (and would be fixed too with your change to _safe_ParseResult())
This change needs an update to the tests. Could you add 1 or 2 for this case?

w3lib/url.py Outdated
@@ -374,7 +374,15 @@ def _safe_ParseResult(parts, encoding='utf8', path_encoding='utf8'):
# IDNA encoding can fail for too long labels (>63 characters)
# or missing labels (e.g. http://.example.com)
try:
netloc = parts.netloc.encode('idna')
idx = parts.netloc.rfind(u':')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of using .hostname and .port attributes from the SplitResult instead of looking for a port number in netloc?

@redapple
Copy link
Contributor

redapple commented Oct 5, 2017

@pp-qq , the issue with pypy on Travis is unrelated. I'm trying to fix it in #99

@hidva
Copy link
Author

hidva commented Oct 19, 2017

ERROR: pypy: InterpreterNotFound: pypy
ERROR: The command "pip install -U tox twine wheel codecov" failed and exited with 2 during .

:rtype: unicode
"""
try:
idx = onetloc.rfind(u':')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests.
My #98 (comment) still holds though.
Have you considered using SplitResult .hostname and .port attributes?

Copy link
Author

@hidva hidva Oct 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current behavior:

# Calling this function on an already "safe" URL will return the URL unmodified.
>>> w3lib.url.safe_url_string('https://www.baIdu.com')
'https://www.baIdu.com'  # I
>>> w3lib.url.canonicalize_url('http://www.baidu.com:80000000')
'http://www.baidu.com:80000000/'

when _encode_netloc() using SplitResult .hostname and .port attributes:

>>> w3lib.url.safe_url_string('https://www.baIdu.com')
'https://www.baidu.com'
>>> w3lib.url.canonicalize_url('http://www.baidu.com:80000000')
'http://www.baidu.com/'

because python 3.5 says:

hostname is Host name (lower case)...

Is this what we want?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with lowercasing the hostname. But it is indeed a change in behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> w3lib.url.canonicalize_url('http://www.baidu.com:80000000')
'http://www.baidu.com/'

what's the reason for the "lost" port number?

Copy link
Author

@hidva hidva Oct 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python 3.5 says: port is None when value if not valid:

Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from six.moves.urllib.parse import (urljoin, urlsplit, urlunsplit,
...                                     urldefrag, urlencode, urlparse,
...                                     quote, parse_qs, parse_qsl,
...                                     ParseResult, unquote, urlunparse)
>>> 
>>> t = urlparse('http://www.BaidU.com:80')  # valid port
>>> t.hostname, t.port
('www.baidu.com', 80)
>>> t = urlparse('http://www.BaidU.com:65536')  # invalid port
>>> t.hostname, t.port
('www.baidu.com', None)

And the new _encode_netloc():

def _encode_netloc(parts):
    """
    :type parts: ParseResult
    :rtype: unicode
    """
    try:
        hostname = to_unicode(parts.hostname.encode('idna'))
        netloc = hostname if parts.port is None else '%s:%s' % (hostname, parts.port)
        # lost the port number if parts.port is None
    except UnicodeError:
        netloc = parts.netloc
    return netloc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a shame that urlsplit and urlparse in Python 2.7 and 3.5 simply swallow an invalid port number.
Python 3.6 does report an error:

$ python
Python 3.6.2 (default, Aug 24 2017, 10:48:24) 
[GCC 6.3.0 20170406] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urlsplit
>>> t = urlsplit('http://www.BaidU.com:654444')
>>> t.port
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/urllib/parse.py", line 169, in port
    raise ValueError("Port out of range 0-65535")
ValueError: Port out of range 0-65535

@Gallaecio Gallaecio closed this Aug 8, 2019
@Gallaecio Gallaecio reopened this Aug 8, 2019
@yozachar
Copy link

Bumping to close outdated PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants