Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems on Windows with username or hostname containing non-ASCII characters #3463

Closed
pekkaklarck opened this issue Feb 7, 2016 · 39 comments
Labels
auto-locked Outdated issues that have been locked by automation C: encoding Related to text encoding and likely, UnicodeErrors type: bug A confirmed bug or unintended behavior

Comments

@pekkaklarck
Copy link
Contributor

pekkaklarck commented Feb 7, 2016

When organizing a Python training recently, one participant failed to use pip after a fresh Python 2.7.11 installation on Windows. Quick investigation showed that the reason problem was ä in her username. We failed to workaround that even by creating a new account and needed to use python setup.py install instead.

I now tried to reproduce the problem on my virtual machine. My main account there has only ASCII characters in the username but I created another for testing purposes. Clearly everything is not correct:

C:\Users\Ürjö>pip
Traceback (most recent call last):
  File "c:\python27\lib\runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "c:\python27\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "C:\Python27\Scripts\pip.exe\__main__.py", line 9, in <module>
  File "c:\python27\lib\site-packages\pip\__init__.py", line 210, in main
    cmd_name, cmd_args = parseopts(args)
  File "c:\python27\lib\site-packages\pip\__init__.py", line 165, in parseopts
    parser.print_help()
  File "c:\python27\lib\optparse.py", line 1670, in print_help
    file.write(self.format_help().encode(encoding, "replace"))
  File "c:\python27\lib\optparse.py", line 1650, in format_help
    result.append(self.format_option_help(formatter))
  File "c:\python27\lib\optparse.py", line 1633, in format_option_help
    result.append(group.format_help(formatter))
  File "c:\python27\lib\optparse.py", line 1114, in format_help
    result += OptionContainer.format_help(self, formatter)
  File "c:\python27\lib\optparse.py", line 1085, in format_help
    result.append(self.format_option_help(formatter))
  File "c:\python27\lib\optparse.py", line 1074, in format_option_help
    result.append(formatter.format_option(option))
  File "c:\python27\lib\optparse.py", line 316, in format_option
    help_text = self.expand_default(option)
  File "c:\python27\lib\site-packages\pip\baseparser.py", line 112, in expand_default
    return optparse.IndentedHelpFormatter.expand_default(self, option)
  File "c:\python27\lib\optparse.py", line 288, in expand_default
    return option.help.replace(self.default_tag, str(default_value))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xdc' in position 9: ordinal not in range(128)

Interestingly installation and uninstallation seem to work fine also with this account. I guess the difference with the problem I saw earlier could be that my main user/admin doesn't have non-ASCII characters.


UPDATE: It later turned out that pip install is totally broken if the hostname has non-ASCII characters. That explains why creating a new account with just ASCII characters in the username didn't didn't work when I encountered this first time and also why I couldn't reproduce that more severe problem with just an account with non-ASCII username.

A workaround for both of these problems is using --no-cache-dir. Both problems are also fixed by PR #3970 that hopefully gets merged and released at some point.

@xavfernandez
Copy link
Member

Hello, what is the pip version ? pip --version

@xavfernandez xavfernandez added the C: encoding Related to text encoding and likely, UnicodeErrors label Feb 7, 2016
@pekkaklarck
Copy link
Contributor Author

On my machine I got 8.0.2. The other setup I saw had a fresh Python 2.7.11 installation. Don't remember pip version, but it must have been the one bundled with the Python 2.7.11 Windows installer.

@xavfernandez xavfernandez added the type: bug A confirmed bug or unintended behavior label Feb 8, 2016
@piotr-dobrogost
Copy link

In the past problem with non-ASCII usernames on Windows was raised in issue #1713. In that case the traceback was also pointing to optparse module.

@pekkaklarck
Copy link
Contributor Author

Faced this again on a new training course. This time username had no non-ASCII characters, but Windows had Finnish locale and thus used C:\Käyttäjät\Name instead of C:\Users\Name. Still using Python 2.7.11 and installation was fresh.

This time our workaround was downloading the package and using pip install package-1.0.tar.gz. Less work than extracting the package and using python setup.py install, but required installing dependencies using the same approach first.

Would someone be interested to help me with the fix/PR if I try to debug why this actually happens? What timeline would we have to get the fix into Python 2.7.12? This is a very annoying bug and gives new users a bad first impression about pip and Python. It must be fixed in a version distributed with Python because the bug prevents upgrading pip itself.

@pfmoore
Copy link
Member

pfmoore commented May 17, 2016

Pure speculation, but that sounds like it might be where pip is trying to put the download cache. Maybe as a workaround you could try using the --cache-dir option to ask pip to put the cache somewhere without non-ASCII characters? Even if it's no better than your current workaround, if it works it might help pinpoint the issue.

@pekkaklarck
Copy link
Contributor Author

Yes, --cache-dir seems to be causing this. I debugged the problem on my machine (see the original description) and noticed that it crashes when formatting help text for exactly that option. Looking at the pip and optparse code I think I found the root cause:

  • pip sets default value for --cache-dir based on what pip.utils.appdirs.user_cache_dir returns. It is a Unicode string and in my case it is u'C:\\Users\\\xdcrj\xf6\\AppData\\Local\\pip\\Cache'.
  • optparse formats the default value in HelpFormatter.expand_default (line 288) and uses str(default_value) there.

Based on this analysis the real bug would be in optparse. Not sure what's the best way to avoid it on pip side.

@pfmoore
Copy link
Member

pfmoore commented May 17, 2016

Yep, that analysis is exactly right. It's frustrating, because we don't even try to display the default value. It's also only an issue on Python 2 (of course... :-()

A bug report against cpython might be worth it. But it'd be nice to work around it in pip, as "upgrade to 2.7.12" seems like a bit of a heavy handed recommendation for this.

The only fix I can see within pip is to set the default for --cache-dir to None, and then after we've done option parsing, set the actual default if the user hasn't overridden it. But I'm not too familiar with pip's option system, so I don't know precisely where that change would need to go (and I don't have the time right now to dig into the code).

@pekkaklarck
Copy link
Contributor Author

Just realized that it is likely that there are also other problems related to this. My students got errors when installing a package, not when showing help like I do, and I think their tracebacks were also different than what I get. Unfortunately I cannot reproduce that problem nor did I ask them to send me tracebacks. I can try creating a new Windows virtual machine with the main user having a non-ASCII user name and can also play with locale settings. We also continue the latest training next Monday and I can debug this in my student's machine then.

@pekkaklarck
Copy link
Contributor Author

pekkaklarck commented May 17, 2016

Yeah, this isn't just a problem with optparse. I installed new Windows 7 machine with main user Päivi, installed Python 2.7.11 and got the following traceback when running pip install robotframework.

Exception:
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\pip\basecommand.py", line 211, in main
    status = self.run(options, args)
  File "c:\python27\lib\site-packages\pip\commands\install.py", line 294, in run
    requirement_set.prepare_files(finder)
  File "c:\python27\lib\site-packages\pip\req\req_set.py", line 334, in prepare_files
    functools.partial(self._prepare_file, finder))
  File "c:\python27\lib\site-packages\pip\req\req_set.py", line 321, in _walk_req_to_install
    more_reqs = handler(req_to_install)
  File "c:\python27\lib\site-packages\pip\req\req_set.py", line 461, in _prepare_file
    req_to_install.populate_link(finder, self.upgrade)
  File "c:\python27\lib\site-packages\pip\req\req_install.py", line 250, in populate_link
    self.link = finder.find_requirement(self, upgrade)
  File "c:\python27\lib\site-packages\pip\index.py", line 486, in find_requirement
    all_versions = self._find_all_versions(req.name)
  File "c:\python27\lib\site-packages\pip\index.py", line 404, in _find_all_versions
    index_locations = self._get_index_urls_locations(project_name)
  File "c:\python27\lib\site-packages\pip\index.py", line 378, in _get_index_urls_locations
    page = self._get_page(main_index_url)
  File "c:\python27\lib\site-packages\pip\index.py", line 818, in _get_page
    return HTMLPage.get_page(link, session=self.session)
  File "c:\python27\lib\site-packages\pip\index.py", line 928, in get_page
    "Cache-Control": "max-age=600",
  File "c:\python27\lib\site-packages\pip\_vendor\requests\sessions.py", line 477, in get
    return self.request('GET', url, **kwargs)
  File "c:\python27\lib\site-packages\pip\download.py", line 373, in request
    return super(PipSession, self).request(method, url, *args, **kwargs)
  File "c:\python27\lib\site-packages\pip\_vendor\requests\sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "c:\python27\lib\site-packages\pip\_vendor\requests\sessions.py", line 605, in send
    r.content
  File "c:\python27\lib\site-packages\pip\_vendor\requests\models.py", line 750, in content
    self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
  File "c:\python27\lib\site-packages\pip\_vendor\requests\models.py", line 673, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "c:\python27\lib\site-packages\pip\_vendor\requests\packages\urllib3\response.py", line 307, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "c:\python27\lib\site-packages\pip\_vendor\requests\packages\urllib3\response.py", line 243, in read
    data = self._fp.read(amt)
  File "c:\python27\lib\site-packages\pip\_vendor\cachecontrol\filewrapper.py", line 54, in read
    self.__callback(self.__buf.getvalue())
  File "c:\python27\lib\site-packages\pip\_vendor\cachecontrol\controller.py", line 244, in cache_response
    self.serializer.dumps(request, response, body=body),
  File "c:\python27\lib\site-packages\pip\download.py", line 276, in set
    return super(SafeFileCache, self).set(*args, **kwargs)
  File "c:\python27\lib\site-packages\pip\_vendor\cachecontrol\caches\file_cache.py", line 99, in set
    with self.lock_class(name) as lock:
  File "c:\python27\lib\site-packages\pip\_vendor\lockfile\mkdirlockfile.py", line 18, in __init__
    LockBase.__init__(self, path, threaded, timeout)
  File "c:\python27\lib\site-packages\pip\_vendor\lockfile\__init__.py", line 189, in __init__
    hash(self.path)))
  File "c:\python27\lib\ntpath.py", line 85, in join
    result_path = result_path + p_path
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 1: ordinal not in range(128)

@pekkaklarck
Copy link
Contributor Author

Also tested that using --no-cache-dir works as a workaround.

@pfmoore
Copy link
Member

pfmoore commented May 17, 2016

OK, I suspect this is just symptomatic of the usual "Python 2.7 is rubbish at working with Unicode" situation. It's possible that changing _get_win_folder in pip.utils.appdirs to return a (byte) string under Python 2, encoded in a suitable encoding, would help, but I don't really know what encoding would be appropriate, nor do I know if that would simply introduce other problems.

This probably needs someone with Python 2/Unicode experience on Windows to look into it. I'm a Python 3 person myself, so I can't really offer much more. Sorry.

At least you have a workaround, which is something I guess...

@pekkaklarck
Copy link
Contributor Author

pekkaklarck commented May 17, 2016

Little more debugging. The problem occurs on lockfile/__init__.py, line 189, in this code:

        self.unique_name = os.path.join(dirname,
                                        "%s%s.%s%s" % (self.hostname,
                                                       self.tname,
                                                       self.pid,
                                                       hash(self.path)))

The reason is that dirname is a Unicode string (created based on cache dir) and self.hostname is a byte string returned by socket.gethostname() and both contain non-ASCII characters:

u'c:\\users\\p\xe4ivi\\appdata\\local\\pip\\cache\\http\\3\\f\\b\\1\\f'
'P\xe4ivi-PC'

This would explain why the problem doesn't occur in my main Windows virtual machine which has only ASCII characters in its host name. Unfortunately the default machine name Windows creates is based on the user name (like 'Päivi-PC' in my case).

@pekkaklarck
Copy link
Contributor Author

Do you @pfmoore have any Windows/Python2/Unicode gurus in the pip team? You are right that --no-cache-dir is a great workaround when you know it, but it would be awesome if we could find a way to actually fix this.

@piotr-dobrogost
Copy link

Fixing lockfile lib might be difficult as it's deprecated and pip has policy of bundling only released versions. The question is if after fixing lockfile we could have another release of lockfile.

From https://pypi.python.org/pypi/lockfile

Note: This package is deprecated. It is highly preferred that instead of
using this code base that instead fasteners_ or oslo.concurrency_ is
used instead. For any questions or comments or further help needed
please email openstack-dev_ and prefix your email subject
with [oslo][pylockfile] (for a faster response).

@pfmoore
Copy link
Member

pfmoore commented May 18, 2016

@pekkaklarck On reflection, although this is triggered by Windows usernames with non-ASCII characters, I guess it's a platform-independent issue. So it's really just an issue of Python 2's Unicode model.

There seem to be multiple issues here though - the issue with help is because optparse tries to interpolate the default value of --cache-dir into the help (even though it's not displayed). The issue with lockfile is because the cache directory is Unicode. Both of these are problems in code outside of pip (stdlib and a vendored library respectively). So there's not a lot that I can see that we can do within pip.

@dstufft
Copy link
Member

dstufft commented May 18, 2016

We should be able to fix the optparse issue by avoiding the default value like @pfmoore said I think yes?

The lockfile issue is trickier :( There's a replacement for lockfile but CacheControl can't use it yet so it's likely that fixing it would require first updating CacheControl to use the new lockfile replacement, then possible fixing that lockfile replacement if it has the same problem.

@pekkaklarck
Copy link
Contributor Author

pekkaklarck commented May 18, 2016

Little more debugging and testing. It seems to me that pip expects paths to be byte strings on Python 2. At least both OSX code and general UNIX code in the user_cache_dir function always return bytes:

# OSX:
path = os.path.expanduser("~/Library/Caches")
# UNIX:
path = os.getenv("XDG_CACHE_HOME", os.path.expanduser("~/.cache"))

I would assume Windows code returning Unicode from the same method is a bug. I already tested that simply encoding the Windows path with the MBCS codec fixes both pip install and pip --help. This is very promising, but there are various issues to take into account:

  1. Is the same pip code used with Python 2 and 3? If yes, encoding should obviously be conditional.
  2. Is the same pip code used by other Python implementations? I know IronPython supports MBCS but Jython doesn't. What about PyPy and others? I'm not sure is fixing this on all variants worth the effort, but we obviously cannot break pip on them by using an unknown codec.
  3. There seem to be two implementations of _get_win_folder. The default one is based on ctypes and the other based on querying registry. Should both be tested/fixed? On my first quick test the alternative methods seems to be totally broken...
  4. Is there any way to add tests for this?

@pfmoore
Copy link
Member

pfmoore commented May 18, 2016

@pekkaklarck

I would assume Windows code returning Unicode from the same method is a bug.

I could at least as reasonably argue that the other code paths not returning Unicode is a bug (we should be using Unicode internally, not passing round values that may be one or may be another). But I will concede that returning different things depending on the OS is a bug. That same argument does mean that returning different things depending on the Python version is at least questionable, if not just as much of a bug, though.

Not having pip break for users is more important than arguing about Unicode purity, though. I just want to make sure we don't end up maintaining a fragile hack (and playing whack-a-mole with Unicode/str bugs throughout the code...)

To answer your questions:

  1. Yes, this code is used for both Python 2 and 3.
  2. There is nothing that would stop the code being used for other implementations. We test PyPy, but not IronPython or Jython. I don't know whether IronPython or Jython users typically use pip - I don't recall ever having seen any feedback from such users, so I suspect usage with those Python implementations is pretty minimal.
  3. I can't think of any situation where the ctypes implementation would be not used on CPython (ctypes is a non-optional module, AFAIK). But other implementations may well not provide ctypes - so the other implementation of _get_win_folder is needed (but may experience limited use).
  4. Tests of what? Of _get_win_folder would be hard - you're basically just testing Windows APIs (and whether it satisfies its own contract - which is currently "return the directory as Unicode", you're proposing changing that contract to "return Unicode on Python 3 and an encoded byte string on Python 2" which means a test wouldn't have caught an issue here). Of user_cache_dir returning edge case results, not so hard - you could simply monkeypatch that function to return whatever you wanted.

A question of my own - is it possible for user names, profile directories or whatever to not be encodable using the MBCS codec? You should probably use sys.getfilesystemencoding() rather than hard coding MBCS anyway, but the same question applies either way.

@dstufft
Copy link
Member

dstufft commented May 18, 2016

I think the ideal situation would be for pip to be all unicode internally.

@pfmoore
Copy link
Member

pfmoore commented May 18, 2016

I think the ideal situation would be for pip to be all unicode internally.

+1.

@pekkaklarck
Copy link
Contributor Author

All Unicode internally is definitely an ideal situation, but I'm not sure is that a practical target at the moment:

  • We already know some dependencies don't support that.
  • It is possible that other similar problems are encountered if switching to Unicode.
  • Python 2 filesystem APIs in general work with bytes. Converting to Unicode just convert back to bytes may not be worth the effort in general.
  • Switching to Unicode would require adding lot of u'' prefixes (do we need to care about Py 3.2?) and possibly using different methods on Py 2 and 3 (os.getcwd/getcwdu).

Due to the above reasons, I believe pip should use "native strings" (i.e. str) both on Py 2 and 3.

@pekkaklarck
Copy link
Contributor Author

AFAIK, the MBCS codec is somehow dynamic and Windows ends up using whatever actual encoding the system uses. If that's the case, using it with data returned by Windows APIs ought to be safe.

Both Jython and IronPython nowadays support pip, and at least we recommend using it in Robot Framework installation instructions also on those platforms. Not sure how widely pip actually is used on them, nor do I know what kind of changes could possibly break it. We can try summoning @jimbaker and @jdhardy here to comment.

@pekkaklarck
Copy link
Contributor Author

Should I create a PR fixing this by encoding Unicode paths returned by Windows APIs to bytes using MBCS codec on Python 2? Should I try also encoding with ASCII if MBCS is not found? Or should ASCII be used first and MBCS only if that fails? If everything fails, should there be an error or should the Unicode string be left through and hope for the best?

@pfmoore
Copy link
Member

pfmoore commented May 20, 2016

Honestly, I'm not sure I'd want to take that approach. And without CI on Windows, having a Windows-specific change like that seems high-risk.

Maybe the first thing to do would be to submit a PR containing some tests that demonstrate this bug. That may require getting Appveyor testing set up (or some pretty major monkeypatching of os.platform to exercise the code path on Unix...) That way, we'd have some means of being sure any fix works (and stays working) without needing someone on Windows with a non-ASCII username to check it out.

@pekkaklarck
Copy link
Contributor Author

I don't think this can really be tested outside Windows. The current functionality uses Windows API or registry, and neither of them work elsewhere. They could be mocked but then we'd be testing the mock. The fix should probably be implemented in a separate helper method converting Unicode paths to bytes. That could be tested directly, but the implementation would use the MBCS codec that's only available on Windows.

@pfmoore
Copy link
Member

pfmoore commented May 22, 2016

Due to the above reasons, I believe pip should use "native strings" (i.e. str) both on Py 2 and 3.

I'm strongly against this. It goes against every piece of advice I have seen (or given!) on how to handle Unicode, which is to use purely Unicode within your application, and convert to and from byte strings at well-defined "boundaries".

To respond to your individual points:

We already know some dependencies don't support that.

So we raise issues against those dependencies, or patch around the problem (i.e. treat the dependency as "outside the boundary" and convert to/from Unicode when interfacing to it).

It is possible that other similar problems are encountered if switching to Unicode.

Without evidence, this is pure speculation. General experience has been that a pure-Unicode model is far less likely to have problems, but we're both just making unsupported statements at this point.

Python 2 filesystem APIs in general work with bytes. Converting to Unicode just convert back to bytes may not be worth the effort in general.

I'll not get into a Python 2 vs Python 3 debate here (suffice it to say that my view is that if you want proper support for non-ASCII data, you should use Python 3). But "converting to Unicode just to convert back" is essentially the definition of the "Unicode internally" strategy, and the benefit is basically "far fewer encoding bugs". So IMO, the benefit of going to Unicode is precisely the fixing of this (and probably a number of other) bugs with encodings. I consider that benefit worth the cost. You're claiming that a "native string internally" strategy can provide the same benefit. My experience says you're wrong - it tends to simply replace current bugs with different ones. But note that simply fixing one function isn't "native string internally" either - it's just patching over the issue for now and hoping there aren't other bugs elsewhere that will be triggered by the change.

Switching to Unicode would require adding lot of u'' prefixes (do we need to care about Py 3.2?) and possibly using different methods on Py 2 and 3 (os.getcwd/getcwdu).

This is an approach that has been taken by a lot of projects, and as far as I know has never been an issue. Python 2/3 compatibility code is a fact of life, the fact that some of it is needed to cover conversion to Unicode isn't likely to be that much of an extra burden. But again, this feels like an unsubstantiated claim. The only way to know for sure is to actually try coding it.

By the way, I'll also point out that there's a big issue with your "native strings internally" proposal that you've not considered - specifically under Python 2 where a "native string" is a bytestring. Without knowing the encoding, a Python 2 string is meaningless, and strings can come into pip from a variety of sources. For example, the filesystem (os.fsencoding), the registry as here (native Unicode, so choose your encoding), or files such as requirements files (UTF-8). Do you propose normalising everything to a particular encoding (which basically means you're doing a by-hand version of Unicode everywhere) or are you willing to take the risk that 2 strings in different encodings are needed in the same piece of code - that's how we get encoding errors...

Anyway, I remain -1 on trying to force things to work by encoding stuff until bugs stop appearing. I'm +1 on long-term going to all-Unicode internally. I think that it's a bug for user_cache_dir to return a platform-specific type, but I want it to return Unicode on all platforms (because that's the option that doesn't lose information) - and I accept that means we still have to find a fix for the issues noted in this report. But I think that patches at the point where we send the data to the dependencies that can't handle Unicode is a better short-term approach.

@pekkaklarck
Copy link
Contributor Author

I'd like to separate the discussion about getting this really severe bug fixed ASAP and how pip handles bytes/Unicode. The fact is that pip uses bytes internally on Python 2 and that seems to work pretty well except for this particular issue. I have demonstrated that the problem can be fixed easily, without resorting to hacks, and I'm willing to create a pull request. I'd obviously be happy if someone else is interested to change pip to use Unicode internally also on Python 2 and fixes this issue along the way.

The only thing I really care is that this issue would be fixed in the next Python 2.7.x release. Letting us in the non-ASCII world to deal with such bugs, even when there is a simple fix available, would be very much against the practicality beats purity principle.

@pfmoore
Copy link
Member

pfmoore commented May 23, 2016

@pekkaklarck Agreed, which is why I suggested a fix for the optparse issue, which was the original subject of this PR, and suggested converting the data before passing it to the lockfile module (which should probably have been raised as a separate issue, but that's a minor point).

I'm 100% in favour of getting pip to work with non-ASCII data. But we should be able to do that in a way that doesn't need to be ripped out and reworked when we move to the all-Unicode approach.

@pekkaklarck
Copy link
Contributor Author

Had a change to investigate this on my student's machine today. Learning:

  • --no-cache-dir avoids the problem with installation. A good workaround when you know it.
  • The problem with installation actually happens if the hostname has non-ASCII characters regardless does the path to the cache directory have non-ASCII characters or not. lockfile gets hostname using socket.gethostname() that returns it in bytes, and joining that with the cache path passed in as Unicode obviously fails.
  • The problem with help text in optparse occurs if the cache path has non-ASCII characters regardless the hostname. It cannot be avoided with --no-cache-dir.

@pekkaklarck
Copy link
Contributor Author

@pfmoore: If we add a helper method to encode the Unicode path returned by Windows APIs to bytes, it obviously needs to be removed if pip is later changed to use Unicode internally. That wouldn't be a big task, though. The helper method itself would be pretty simple and the changes to the current Windows code would be something like this:

if PY2 and isinstance(path, unicode):
    path = _windows_unicode_path_to_bytes(path)

Although I only really care about this issue being fixed, I'd like to put my 0,02€ in about Unicode vs bytes in this particular case. I fully agree that programs in general should use Unicode internally for presenting text. In this case there several reasons I doubt changing how pip handles paths is a good idea:

  • AFAIK, there is no reliable way to get the file system encoding in platform independent manner on Python 2.
  • POSIX doesn't enforce any encoding and file system paths are just bytes. In Python 3 a lot of effort has been put to make stuff os.listdir work correctly (e.g. PEP-383). Is that worth the effort on pip on Python 2?
  • You shouldn't actually think paths as strings in general (see PEP-519). In that regard keeping paths as bytes when using Python 2 wouldn't even violate the principle to keep all text as Unicode internally. INI files etc. could (and should) be still processed as Unicode.
  • As already discussed, various external modules do not work with Unicode.
  • Changing to Unicode internally would be a huge change potentially causing lot of problems. "If it ain't broke, don't fix it" and "Practicality beats purity" both warn against that.

@osnoser1
Copy link

osnoser1 commented Aug 12, 2016

My temporal fix in optparse.py: 🙈
default_value.encode('utf-8') if isinstance(default_value, unicode) else str(default_value)

@pekkaklarck pekkaklarck changed the title Problems on Windows with username containing non-ASCII characters Problems on Windows with username or hostname containing non-ASCII characters Sep 14, 2016
@pekkaklarck
Copy link
Contributor Author

Encountered this again in another training. This time the machine had hostname "Kotiläppäri" (home laptop in Finnish). Used the pip version included with Python 2.7.12.

pekkaklarck added a commit to pekkaklarck/pip that referenced this issue Sep 16, 2016
Two related problems are fixed:
- Previously non-ASCII characters in hostname blew up `pip install`
  completely. This is rather severe.
- Non-ASCII characters in username crashed printing help text. Not
  so bad but definitely annoying.

Non-ASCII usernames are pretty common in non-English speaking
countries. That also makes non-ASCII hostnames pretty common, because
Windows creates hostname based on the username by default.

The reason for these failures was that `user_cache_dir` returned
Unicode on Windows and bytes elsewhere, and rest of the codebase was
expecting paths to be bytes.

It could be argued that pip should always use Unicode internally, but
that would require a lot more changes and fixing also some of the
vendored dependencies. We can also argue, like PEP-519 does, that
paths should not be considered to be strings at all, and thus the "all
Unicode internally" guideline wouldn't apply in this case.

Fixes pypa#3463. See that issue for more discussion and details.
@pekkaklarck
Copy link
Contributor Author

PR #3970 fixes both the more severe problem with pip install crashing if hostname has non-ASCII characters, and the less severe but annoying crash with pip --help if username has non-ASCII characters.

@Champal
Copy link

Champal commented Sep 29, 2016

Hi,

4 days that I'm looking for a solution for my problem quite simple : Why pip doesn't work on my pc ? "UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 5: ordinal not in range(128)"
Today I found your topic and your PR, the solution is so "small" and it works !
I hope your PR will be merge in next version.

It's incredible that in all over the world there have so little people whith name including a non-ascii letter !
It' also incredible that in 2016 python doesn't support correctly unicode.

Before I did not like Python. But some software use Pyhton so I install Python :-(
Today I really doesn't like Python and that confirm 2 things :
First : Python is broken/weird/messy language
Second : Thanks to people like "pekkaklarck" for your tenacity, you have tracking this bug, convice people here, and found the solution ! You are a GREAT man , thanks.

The software (platformIO) that I want to use still failed in a later place in a Python mystery

File "C:\Users\STPHAN~1\AppData\Local\Temp\d-116829-10092-19pzswv.27458tcsor\virtualenv-14.0.6\virtualenv.py", line 783, in call_subprocess
    % (cmd_desc, proc.returncode))
OSError: Command C:\.pioidepenv\Scripts\python.exe -c "import sys, pip; sys...d\"] + sys.argv[1:]))" setuptools pip wheel failed with error code 2

C:\Users\Stéphane>virtualenv --version
14.0.6

C:\Users\Stéphane>virtualenv test
New python executable in C:\Users\StÚphane\test\Scripts\python.exe
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)
ERROR: The executable C:\Users\StÚphane\test\Scripts\python.exe is not functioning
ERROR: It thinks sys.prefix is u'c:\\users\\st\xe9phane' (should be u'c:\\users\\st\xe9phane\\test')
ERROR: virtualenv is not compatible with this system or executable
Note: some Windows users have reported this error when they installed Python for "Only this user" or have multiple versions of Python installed. Copying the appropriate PythonXX.dll to the virtualenv Scripts/ directory may fix this problem

This is REALLY FUNNY : "some Windows users", "may fix" !
Generally the answers on forum is : "resinstall and it's maybe auto magically work !"

Sorry for this long post but it's so frustating that so many software use Pyhton, and Python was so messy

Again thanks to pekkaklarck, you are making this "thing" less buggy.
At my side I will avoid using Python and all software that use Python, it's not easy as more and more "developers" use Python.

"I have a dream that one day" serious developers use reliable language.
Stéphane.

@pekkaklarck
Copy link
Contributor Author

@Champal To be fair with Python, this kind of problems shouldn't occur anymore in Python 3. The underlying issue here is mixing bytes and Unicode, but that only causes problems if username or hostname contains non-ASCII characters. On Python 3 you simply cannot mix bytes and Unicode like that.

Good news is that my PR got a positive review and hopefully we get final issues resolved before pip 8.2 release. Until that using --no-cache-dir is a workaround.

@Champal
Copy link

Champal commented Sep 30, 2016

Your PR is good but only for pip
There is the same bug (UnicodeDecodeError) with virtualenv and I think virtualenv will never work correctly :
2013 : "Problem with non ASCII car in path" pypa/virtualenv#457
2014 : "support install into non-ASCII directories" pypa/virtualenv#912
This PR pypa/virtualenv#900 seems to be the solution ?

But with Pyhon 3.5 there is also an other problem : "virtualenv fails with Python 3.5 on Windows" pypa/virtualenv#796

Ahhhhhhhhhhhhh !

To return to the source of the problem : platfromIO (http://platformio.org/).
They are saying "The next-generation integrated development environment for IoT." ah ah ah !
platformIO initial commit : may 2014 (platformio/platformio-core@e0c4fb3)
2010 : Python 2.7
2012 : Python 3.3
the question is why a project started in 2014 use Python 2.7 ?
"PlatformIO depends on SCons" platformio/platformio-core#595

Ok let's go to Scons. http://scons.org/ "a next-generation build tool" Ah ah ah ! (with python 2.7 !)
This page is funny : http://scons.org/tag/releases.html
This is 7 times this phrase : "This will be the last release to support Python versions earlier than 2.7, as we begin to move toward supporting Python 3."
No date on a release page, very useful
Download page at SourceForge !!!!
At least the release 2.5.0 (09 apr 2016) annouce the future support of Python3.

So I will wait the upgrade to Pyhton 3 and perhaps this pypa/virtualenv#796 will be fixed before ... The suspense is unbearable .. No I'm joking, I don't care, I have no hope, it's just a distraction ...

Good luck, I going away and stop disturbe your thread.
Sorry,
Stéphane

pfmoore added a commit to pfmoore/pip that referenced this issue Oct 5, 2016
@pfmoore
Copy link
Member

pfmoore commented Oct 6, 2016

PR #4000 now merged. Can this issue now be closed? There's a lot of discussion here about various Unicode-related problems, and I don't want to unilaterally close this in case there are other problems here that the PR didn't resolve.

@pekkaklarck if you're happy that the issue is now fixed, can you close it?

@pekkaklarck
Copy link
Contributor Author

PR #4000 fixes the original problem and I would say this issue can be closed. Should a milestone or some labels be set before?

@pfmoore
Copy link
Member

pfmoore commented Oct 7, 2016

No, AIUI we use milestones for "we need to fix this in x.y" not for "this has been fixed in x.y". So I'll close this and just note here that the fix should appear in the next release (8.2)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation C: encoding Related to text encoding and likely, UnicodeErrors type: bug A confirmed bug or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants