Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PYTHON-API: vim.eval breaks, if the string a:items contains invalid utf-8 characters #3213

Closed
alphaCTzo7G opened this issue Jul 15, 2018 · 20 comments

Comments

@alphaCTzo7G
Copy link

alphaCTzo7G commented Jul 15, 2018

Hi all,

I was trying to figure out why ctrlp-py-matcher is not working when certain
files are present in my repo.

I have this file in my folder:
https://github.com/pallets/jinja/blob/master/jinja2/_identifier.py

Now I try to use ctrlp, ctrlp-py-matcher to navigate tags. I use ctags to generate the tags files.

However, I found that even though the correct tags were being generated, :CtrlPTags wasn't showing any tags.

I have traced it down to the file above
https://github.com/pallets/jinja/blob/master/jinja2/_identifier.py

which contains UTF-8 header and possibly incorrect characters, vim.eval in the ctrlp-py-matcher fails.

Heres the line which fails
https://github.com/FelikZ/ctrlp-py-matcher/blob/cf63fd546f1e80dd4db3db96afbeaad301d21f13/autoload/pymatcher.py#L7

Heres the output of iconv run on the file:

iconv -f utf-8 -t ascii -o _identifier.py _identifier.py
iconv: illegal input sequence at position 65

Other people such as @ludovicchabant also found a similar problem and had to
work around it as shown in his fork of the ctrlp-py-matcher repo, and also
post process the tags file(https://ludovic.chabant.com/devblog/2017/02/25/aaa-gamedev-with-vim/) :

https://github.com/ludovicchabant/ctrlp-py-matcher/blob/2f6947480203b734b069e5d9f69ba440db6b4698/autoload/pymatcher.py#L22

If I use something like perl -i -pe 's/[^[:ascii:]]//g;', to remove any
non-ascii or non-utf-8 characters, then the tag file generated using ctags -R
starts working again.

Currently, I basically excluded the entire python env folder to prevent these
issues. I can ofcourse, do some post-processing of tags file after ctags
generates the tags file and remove the BOM header or invalid characters,
but it seems a bit hacky.

The best way would be if vim.eval('a:items') doesn't error out, even if its
argument contains BOM headers or invalid utf-8 characters, or if there
exists another function other than vim.eval which won't crash when the string
it contains has BOM headers.

Is vim.eval('a:items') supposed to crash if the items in the argument passed
to it (a:item) contains invalid characters?

Is there any way around it?

@alphaCTzo7G alphaCTzo7G changed the title vim.eval breaks, if the string a:items contains invalid utf-8 characters PYTHON-API: vim.eval breaks, if the string a:items contains invalid utf-8 characters Jul 15, 2018
@brammool
Copy link
Contributor

brammool commented Jul 16, 2018 via email

@tonymec
Copy link

tonymec commented Jul 16, 2018

OT1H, IMO crashes ought not no happen.
OTOH, in some multibyte encodings including UTF-8, not every byte sequence is valid.
So what is Vim to do if a string passed from Python (or Perl or…) as UTF-8 is actually not valid UTF-8? At the moment IIUC it crashes. Other possibilities could be (a) return some kind of error; (b) fall back on Latin1 (which, depending on the particular input, might either be the right thing to do or lead to mojibake); (c) replace the invalid sequence by one or more instances of the character � (U+FFFD REPLACEMENT CHARACTER) which is meant for exactly that purpose.

Best regards,
Tony.

@alphaCTzo7G
Copy link
Author

@brammool..

try/catch perhaps?
yes @ludovicchabant 's fork of ctrlp-py-matcher does try.. except in python here: https://github.com/ludovicchabant/ctrlp-py-matcher/blob/2f6947480203b734b069e5d9f69ba440db6b4698/autoload/pymatcher.py#L10

to do exception handling.

Maybe, I am missing something.. but I haven't see any other way of evaluating a vim variable in python-namespace. Even if there is a single character that is invalid in the a:items it just returns an error.

@tonymec,

The first option you mentioned would be ok.. but will probably not help too much, as I haven't seen any other way of evaluating a vim variable in python. Even if I catch the error, I can't evaluate the string in python namespace, so even if there are efficient ways to remove the invalid character in python-name space(https://stackoverflow.com/questions/26541968/delete-every-non-utf-8-symbols-from-string), I can't do that.

The viml matcher in ctrlp works well even if there are invalid characters.. as I can do :CtrlPTag even with the invalid character.. however, its very very slow on large code bases. Thats where python API, ctrlp-py-matcher helps.. because I can use the default python functions to do the heavy lifting.. and why ctrlp-py-matcher is much faster than the same thing done in viml.

I wonder why vim.eval crashes and doesn't return anythng.. but vim can evaluate a:items if the python API is not used (as evidenced by the fact that :CtrlPTag without ctrlp-py-matcher still works, but is very slow)?

I have investigated the 3rd option, but outside VIM.. using perl-script.. but it tends to be error-prone. It would be great if vim.eval could delete the invalid character or replace it with what you suggested. It could be done in python as well: https://stackoverflow.com/questions/26541968/delete-every-non-utf-8-symbols-from-string.. but probably has to be done under the cover of vim.eval.

@lilydjwg
Copy link
Contributor

lilydjwg commented Jul 16, 2018

Please don't fallback to latin1. I've seen too much of garbage caused by that.

There are several ways to handle this:

  1. Use the errors='surrogateescape' option. You can encode back to the orignal string in this way. But some Python APIs may not expect surrogate pairs.
  2. Use the errors='replace' option. Invalid characters are replaced by �. You lose information and can't convert back.
  3. Provide a bytes version like Python do (e.g. os.environb etc. All paths in Python stdlib can be given as bytes too). This puts the burden to the developers using it.
  4. Raise an error with a bytes version in the exception so cautious developers can still handle them but otherwise the error is explicitely shown.

@lilydjwg
Copy link
Contributor

BTW, I don't know what exact problem @alphaCTzo7G has encountered. The _identifier.py file is valid UTF-8 but not valid ASCII. Maybe you are using Python 2 or a default encoding other than UTF-8?

@alphaCTzo7G
Copy link
Author

@lilydjwg.. Interesting.. how are you determining that _identifier.py doesn't contain valid utf-8.

If it isn't then I am wondering what's your hypothesis behind why vim.eval('a:items') is failing if the contents of that file are contained in a:items?

@alphaCTzo7G
Copy link
Author

alphaCTzo7G commented Jul 16, 2018

There are mutliple other python libraries that have shown similar effects (if characters from these files are present in a:items, and there is a call to vim.eval(a:items'), vim.eval('a:items') crashes). I just showed one of them.. Perhaps the reason for these may be different, but there could be one root cause/commonality..

Here are more examples..

  1. https://github.com/mhammond/pywin32/blob/master/com/win32com/HTML/misc.html
  2. https://bitbucket.org/birkenfeld/pygments-main/src/7941677dc77d4f2bf0bbd6140ade85a9454b8b80/pygments/lexers/testing.py?at=default&fileviewer=file-view-default
  3. https://github.com/pallets/jinja/blob/master/jinja2/_identifier.py

I did another test from here: https://stackoverflow.com/a/41741313/4752883

using grep -axv '.*' file.txt to see whether there are invalid characters.

This returns invalid character only for the file in #1. misc.html at the location <H1>Misc stuff I dont know where to put anywhere else</H1>

For iconv -f utf-8 -t ascii -o file1 file1

#1 (misc.html) returns error at : iconv: illegal input sequence at position 407
#2 (testing.py) returns error at iconv: illegal input sequence at position 716
#3 (_identifier.py) returns error at: `iconv: illegal input sequence at position 65

The iconv error seems to indicate that there is illegal input sequence at .., and I am using the -f utf-8 switch.. so doesn't that indicate that there is an incorrect utf-8 character at that location?

So the 2 tests above seem to give conflicting reports.

One hypothesis could be that vim.eval only takes ascii characters and fails if there are utf-8 characters not representably by ascii?

Note: I got the testing.py from pypi.. because I dont have hg installed, I couldn't validate that the testing.py from pypi using pip install pygments has or hasn't changed in the bitbucket repo.

@alphaCTzo7G
Copy link
Author

just to provide some more details.. in case these are important..

I am on Ubuntu 16.04..

here's the output of locale

LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

VIM - Vi IMproved 8.1 (2018 May 17, compiled Jul 1 2018 10:45:10)
Included patches: 1-42

@lilydjwg
Copy link
Contributor

iconv -f utf-8 -t ascii -o file1 file1 tries to convert to ascii, which will fail if the input contains any non-ascii characters.

To tell if a file is in valid UTF-8, you can just open it in Vim with ++enc=utf-8, and see if Vim reports any errors (or use vim --clean if your vim config has troubles with the file).

You can get the file with cloning in both github and bitbucket: use the "raw" link on the page.

You can try to echo a:items out within VimL to see what it's like. Also, try Python 3 if you aren't using it already.

@alphaCTzo7G
Copy link
Author

@lilydjwg .. thank you very much for your suggestions..I will go through them and test them.

Also..I am most likely using python3 and compiled my vim with python3 only (not dynamically linking python2 and python3 simultaneously) based on the instructions here: https://github.com/Valloric/YouCompleteMe/wiki/Building-Vim-from-source

@alphaCTzo7G
Copy link
Author

@lilydjwg, you are right _identifier.py doesnt contain any illegal utf-8 characters. Using :e ++enc=utf-8, trying to navigate to the illegal character using 8g8 or using grep -axv .* doesnt show any errors.. so the file is probably fine.. but still vim.eval errors out for some reason..
2018-07-20_11-27-50

misc.html from pywin32 does contain illegal utf-8 characters.. this is shown by 8g8, :e ++enc=utf-8 as well as grep:
2018-07-20_11-21-29

I am indeed using python3 in my vim, as show below:
2018-07-20_11-38-18

Regarding your question of file encoding, I have set encoding=utf-8, so set encoding returns encoding=utf-8. fileencoding is not set.. and from this post, it seems that if fileencoding is not set, it defaults to encoding: https://stackoverflow.com/a/16507826/4752883

@alphaCTzo7G
Copy link
Author

alphaCTzo7G commented Jul 21, 2018

@lilydjwg .. Is this what you meant?

image

I put echo a:items in the pymatcher#PyMatch function.. It seems to me that this is the function that gets the items.

However, it doesn't actually print out anything..

This fork of ctrlp-py-matcher does print out the incorrect characters: https://github.com/ludovicchabant/ctrlp-py-matcher/blob/2f6947480203b734b069e5d9f69ba440db6b4698/autoload/pymatcher.py#L22

And from this is seems to be the 204th character.. so I opened up the tags file navigated to the line containing the _identifier.py problem and used 204| where | is the pipe symbol to navigate to the 204th characters..

It even gets weirder.. I can't even do $ to navigate to the last character of the line

On line 211, I tried navigating to the last character and I was able to do that..

Then I tried to navigate to the last character using $ on the 212th line.. but couldn't do so.. even though there are characters left on the line.. I can't even navigate past the 65th character
2018-07-21_14-50-29

@lilydjwg
Copy link
Contributor

There is a corrupted character in your tags, and Vim has issues calculating the correct display width of the broken bytes.

You can report issues to the tool that has generated this tags file. (It seems to me that the tool clips long strings by bytes instead of characters.)

@alphaCTzo7G
Copy link
Author

@lilydjwg .. actually not sure if its a problem with Universal ctags, (compiled a few days back.) there seems to be multiple issues here.. One was with the python-api..

I thought what I posted above was related but it seems the issue in this image
image

is related to the terminal I use (Putty), but is also present in Kitty, MobaXterm, BitVise ssh client as well tmux/tmux#1409 (comment)

If I try this in Gvim, or use a different terminal such as Gnome-terminal, or rxvt-unicode that has unicode support, this issue in the above image doesn't happen. So I will probably have to report these to them (Putty/Kitty etc).. Its not too much of a issue.. and probably a edge case, so maybe thats why they haven't bothered to fix it.

vim.eval still breaks for the 3 files I posted regardless of what terminal I use or if I use Gvim..

@vim-ml
Copy link

vim-ml commented Jul 24, 2018 via email

@alphaCTzo7G
Copy link
Author

There is a corrupted character in your tags, and Vim has issues calculating the correct display width of the broken bytes.

@lilydjwg, actually you are right about bringing up the tags file. Thanks for pointing this out.. for some reason, I didn't expect that even though the _identifier.py had all valid characters, ctags would insert invalid utf-8 characters.

While the picture I posted before regarding the location of the cursor is
related to the terminal, it seems that the current version of Ctags indeed
inserts a incorrect character into the file.

Here's some evidence. Note to isolate the problem, this is run a fresh Ubuntu
16.04 VM installed today. The only modifications made are to some libraries
necessary to compile ctags, vim from source.

With exuberant-ctags installed using just sudo apt-get install ctags:

2018-07-29_19-03-44

With Universal-ctags compiled from source (latest commit) as of this post, compiled with instructions from here: https://github.com/universal-ctags/ctags/blob/master/docs/autotools.rst:

2018-07-29_19-10-22

@alphaCTzo7G
Copy link
Author

just thought I would update this..

@b4n has created a PR here to Universal-ctags: universal-ctags/ctags#1807 and it seems that ctags no longer cutting at an arbitrary byte position.

misc.html still doesn't work, as misc.html indeed contains a invalid utf-8 character, which ctags inserts into the tags file.. and thus vim.eval fails..

I will investigate the other option that that others proposed here..

@k-takata
Copy link
Member

misc.html is encoded with Windows-1252 as it is described in the HTML header.
Try passing --input-encoding=Windows-1252 (or --input-encoding=cp1252) and --output-encoding=utf-8 to Universal Ctags.

@alphaCTzo7G
Copy link
Author

@k-takata .. Thanks I have seen that.. unfortunately, I won't know what the encoding is for any given file for a random library that could have thousands of files. This solution may work for misc.html, but will will it work if there is another file with a different (non-utf-8) encoding? The files that I posted were mainly to narrow down the problem.

Based on what @b4n mentioned here: universal-ctags/ctags#1805 (comment)..

it seems the only way around this is to postprocess the tags file or if vim.eval doesn't break if it receives non-utf-8 encoded files/string or if ctags individually determines the type of each file and then creates a tag file with the utf-8 encoding? Do you have any better way for the general case where there are file with potentially multiple different types of encoding in the same repository?

@chrisbra
Copy link
Member

I suppose this is not a Vim issue, so this can be closed, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants