Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Auto merge of #455 - puremourning:unicode-investigation, r=Valloric
[READY] Fix issues with multi-byte characters ## Summary This change introduces more general support for non-ASCII characters in buffers handled by YCMD. In ycmd's public API, all offsets are byte offsets into the UTF-8 encoded buffers. We also assume (because, we have no other choice) that files stored on disk are also UTF-8 encoded. Internally, almost all of ycmd's functionality operates on unicode strings (python 2 `unicode()` and python 3 `str()` objects, transparently via `future`). Many of the downstream completion engines expect unicode code points as the offsets in their APIs. One special case is the `ycm_core` library (identifier completer and clang completer), which requires instances of the _native_ `str` type. All strings used within the c++ using `boost::python` require passing through `ToCppStringCompatible` Previously, we were largely just assuming that `code point == byte offset` - i.e. all buffers contained only ASCII characters. This worked up to a point, but more by luck than judgement in a number of places. ## References In combination with a YCM change and PR #453, I hope this: - fixes #109 - fixes ycm-core/YouCompleteMe#2096 - fixes ycm-core/YouCompleteMe#2088 - fixes ycm-core/YouCompleteMe#2069 - fixes ycm-core/YouCompleteMe#2066 - fixes ycm-core/YouCompleteMe#1378 ## Overview of changes The changes fall into the following areas: - Providing access to and conversion to/from code points and byte offsets (`request_wrap.py`) - Changing certain algorithms/features to work entirely in codepoint space when they are trying to operate on logical 'characters' within the buffer (see known issues for why this isn't perfect, but probably most of the way there) - Changing the completers to convert between the external (on both sides) and internal representations by using the shortcuts provided in `request_wrap.py` - Adding tests for each of the completers for both completions and subcommands ## Completer-specific notes Pretty much all of the completers I tested required some changes: - clang uses utf-8 and byte offsets, but had some bugs with the `GetDoc` parsing stuff - OmniSharp speaks codepoint offsets - Tern speaks codepoint offsets - JediHTTP speaks codepoint offsets - tsserver speaks codepoint offsets - gocode speaks byte offsets - racer i did not test ## Further work / Known issues - we act blissfully ignorant of the case where a unicode character consumes multiple code points (such as where there is a modifier after the code point) - when typing a unicode character, we still get an exception from `bitset` (see #453 for that fix) - the filtering and sorting system is 100% designed for ASCII only, and it is not in the scope of this PR to change that. Currently after any filtering operation, words containing non-ASCII characters are excluded. - I did not get round to testing rust using racer - there are further changes required to YouCompleteMe client (a further PR is coming for that) <!-- Reviewable:start --> --- This change is [<img src="https://reviewable.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/ycmd/455) <!-- Reviewable:end -->
- Loading branch information
Showing
44 changed files
with
1,874 additions
and
382 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.