PYTHON-API: `vim.eval` breaks, if the string `a:items` contains invalid `utf-8` characters #3213

alphaCTzo7G · 2018-07-15T17:22:10Z

Hi all,

I was trying to figure out why ctrlp-py-matcher is not working when certain
files are present in my repo.

I have this file in my folder:
https://github.com/pallets/jinja/blob/master/jinja2/_identifier.py

Now I try to use ctrlp, ctrlp-py-matcher to navigate tags. I use ctags to generate the tags files.

However, I found that even though the correct tags were being generated, :CtrlPTags wasn't showing any tags.

I have traced it down to the file above
https://github.com/pallets/jinja/blob/master/jinja2/_identifier.py

which contains UTF-8 header and possibly incorrect characters, vim.eval in the ctrlp-py-matcher fails.

Heres the line which fails
https://github.com/FelikZ/ctrlp-py-matcher/blob/cf63fd546f1e80dd4db3db96afbeaad301d21f13/autoload/pymatcher.py#L7

Heres the output of iconv run on the file:

iconv -f utf-8 -t ascii -o _identifier.py _identifier.py
iconv: illegal input sequence at position 65

Other people such as @ludovicchabant also found a similar problem and had to
work around it as shown in his fork of the ctrlp-py-matcher repo, and also
post process the tags file(https://ludovic.chabant.com/devblog/2017/02/25/aaa-gamedev-with-vim/) :

https://github.com/ludovicchabant/ctrlp-py-matcher/blob/2f6947480203b734b069e5d9f69ba440db6b4698/autoload/pymatcher.py#L22

If I use something like perl -i -pe 's/[^[:ascii:]]//g;', to remove any
non-ascii or non-utf-8 characters, then the tag file generated using ctags -R
starts working again.

Currently, I basically excluded the entire python env folder to prevent these
issues. I can ofcourse, do some post-processing of tags file after ctags
generates the tags file and remove the BOM header or invalid characters,
but it seems a bit hacky.

The best way would be if vim.eval('a:items') doesn't error out, even if its
argument contains BOM headers or invalid utf-8 characters, or if there
exists another function other than vim.eval which won't crash when the string
it contains has BOM headers.

Is vim.eval('a:items') supposed to crash if the items in the argument passed
to it (a:item) contains invalid characters?

Is there any way around it?

The text was updated successfully, but these errors were encountered:

brammool · 2018-07-16T10:21:52Z

[...]

Is `vim.eval('a:items')` supposed to crash if the items in the argument passed to it (`a:item`) contains invalid characters?

I believe that is the be expected, since it uses the argument as text, thus if the text is invalid it won't work.

Is there any way around it?

try/catch perhaps?

…

-- hundred-and-one symptoms of being an internet addict: 254. You wake up daily with your keyboard printed on your forehead. /// Bram Moolenaar -- Bram@Moolenaar.net -- http://www.Moolenaar.net \\\ /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\ \\\ an exciting new programming language -- http://www.Zimbu.org /// \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

tonymec · 2018-07-16T10:53:12Z

OT1H, IMO crashes ought not no happen.
OTOH, in some multibyte encodings including UTF-8, not every byte sequence is valid.
So what is Vim to do if a string passed from Python (or Perl or…) as UTF-8 is actually not valid UTF-8? At the moment IIUC it crashes. Other possibilities could be (a) return some kind of error; (b) fall back on Latin1 (which, depending on the particular input, might either be the right thing to do or lead to mojibake); (c) replace the invalid sequence by one or more instances of the character � (U+FFFD REPLACEMENT CHARACTER) which is meant for exactly that purpose.

Best regards,
Tony.

alphaCTzo7G · 2018-07-16T16:28:51Z

@brammool..

try/catch perhaps?
yes @ludovicchabant 's fork of ctrlp-py-matcher does try.. except in python here: https://github.com/ludovicchabant/ctrlp-py-matcher/blob/2f6947480203b734b069e5d9f69ba440db6b4698/autoload/pymatcher.py#L10

to do exception handling.

Maybe, I am missing something.. but I haven't see any other way of evaluating a vim variable in python-namespace. Even if there is a single character that is invalid in the a:items it just returns an error.

@tonymec,

The first option you mentioned would be ok.. but will probably not help too much, as I haven't seen any other way of evaluating a vim variable in python. Even if I catch the error, I can't evaluate the string in python namespace, so even if there are efficient ways to remove the invalid character in python-name space(https://stackoverflow.com/questions/26541968/delete-every-non-utf-8-symbols-from-string), I can't do that.

The viml matcher in ctrlp works well even if there are invalid characters.. as I can do :CtrlPTag even with the invalid character.. however, its very very slow on large code bases. Thats where python API, ctrlp-py-matcher helps.. because I can use the default python functions to do the heavy lifting.. and why ctrlp-py-matcher is much faster than the same thing done in viml.

I wonder why vim.eval crashes and doesn't return anythng.. but vim can evaluate a:items if the python API is not used (as evidenced by the fact that :CtrlPTag without ctrlp-py-matcher still works, but is very slow)?

I have investigated the 3rd option, but outside VIM.. using perl-script.. but it tends to be error-prone. It would be great if vim.eval could delete the invalid character or replace it with what you suggested. It could be done in python as well: https://stackoverflow.com/questions/26541968/delete-every-non-utf-8-symbols-from-string.. but probably has to be done under the cover of vim.eval.

lilydjwg · 2018-07-16T16:46:14Z

Please don't fallback to latin1. I've seen too much of garbage caused by that.

There are several ways to handle this:

Use the errors='surrogateescape' option. You can encode back to the orignal string in this way. But some Python APIs may not expect surrogate pairs.
Use the errors='replace' option. Invalid characters are replaced by �. You lose information and can't convert back.
Provide a bytes version like Python do (e.g. os.environb etc. All paths in Python stdlib can be given as bytes too). This puts the burden to the developers using it.
Raise an error with a bytes version in the exception so cautious developers can still handle them but otherwise the error is explicitely shown.

lilydjwg · 2018-07-16T16:55:01Z

BTW, I don't know what exact problem @alphaCTzo7G has encountered. The _identifier.py file is valid UTF-8 but not valid ASCII. Maybe you are using Python 2 or a default encoding other than UTF-8?

alphaCTzo7G · 2018-07-16T18:46:27Z

@lilydjwg.. Interesting.. how are you determining that _identifier.py doesn't contain valid utf-8.

If it isn't then I am wondering what's your hypothesis behind why vim.eval('a:items') is failing if the contents of that file are contained in a:items?

alphaCTzo7G · 2018-07-16T18:46:37Z

There are mutliple other python libraries that have shown similar effects (if characters from these files are present in a:items, and there is a call to vim.eval(a:items'), vim.eval('a:items') crashes). I just showed one of them.. Perhaps the reason for these may be different, but there could be one root cause/commonality..

Here are more examples..

I did another test from here: https://stackoverflow.com/a/41741313/4752883

using grep -axv '.*' file.txt to see whether there are invalid characters.

This returns invalid character only for the file in #1. misc.html at the location <H1>Misc stuff I dont know where to put anywhere else</H1>

For iconv -f utf-8 -t ascii -o file1 file1

#1 (misc.html) returns error at : iconv: illegal input sequence at position 407
#2 (testing.py) returns error at iconv: illegal input sequence at position 716
#3 (_identifier.py) returns error at: `iconv: illegal input sequence at position 65

The iconv error seems to indicate that there is illegal input sequence at .., and I am using the -f utf-8 switch.. so doesn't that indicate that there is an incorrect utf-8 character at that location?

So the 2 tests above seem to give conflicting reports.

One hypothesis could be that vim.eval only takes ascii characters and fails if there are utf-8 characters not representably by ascii?

Note: I got the testing.py from pypi.. because I dont have hg installed, I couldn't validate that the testing.py from pypi using pip install pygments has or hasn't changed in the bitbucket repo.

alphaCTzo7G · 2018-07-16T18:51:36Z

just to provide some more details.. in case these are important..

I am on Ubuntu 16.04..

here's the output of locale

LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

VIM - Vi IMproved 8.1 (2018 May 17, compiled Jul 1 2018 10:45:10)
Included patches: 1-42

lilydjwg · 2018-07-17T05:06:28Z

iconv -f utf-8 -t ascii -o file1 file1 tries to convert to ascii, which will fail if the input contains any non-ascii characters.

To tell if a file is in valid UTF-8, you can just open it in Vim with ++enc=utf-8, and see if Vim reports any errors (or use vim --clean if your vim config has troubles with the file).

You can get the file with cloning in both github and bitbucket: use the "raw" link on the page.

You can try to echo a:items out within VimL to see what it's like. Also, try Python 3 if you aren't using it already.

alphaCTzo7G · 2018-07-17T21:19:07Z

@lilydjwg .. thank you very much for your suggestions..I will go through them and test them.

Also..I am most likely using python3 and compiled my vim with python3 only (not dynamically linking python2 and python3 simultaneously) based on the instructions here: https://github.com/Valloric/YouCompleteMe/wiki/Building-Vim-from-source

alphaCTzo7G · 2018-07-20T18:58:40Z

@lilydjwg, you are right _identifier.py doesnt contain any illegal utf-8 characters. Using :e ++enc=utf-8, trying to navigate to the illegal character using 8g8 or using grep -axv .* doesnt show any errors.. so the file is probably fine.. but still vim.eval errors out for some reason..

misc.html from pywin32 does contain illegal utf-8 characters.. this is shown by 8g8, :e ++enc=utf-8 as well as grep:

I am indeed using python3 in my vim, as show below:

Regarding your question of file encoding, I have set encoding=utf-8, so set encoding returns encoding=utf-8. fileencoding is not set.. and from this post, it seems that if fileencoding is not set, it defaults to encoding: https://stackoverflow.com/a/16507826/4752883

alphaCTzo7G · 2018-07-21T21:39:41Z

@lilydjwg .. Is this what you meant?

I put echo a:items in the pymatcher#PyMatch function.. It seems to me that this is the function that gets the items.

However, it doesn't actually print out anything..

This fork of ctrlp-py-matcher does print out the incorrect characters: https://github.com/ludovicchabant/ctrlp-py-matcher/blob/2f6947480203b734b069e5d9f69ba440db6b4698/autoload/pymatcher.py#L22

And from this is seems to be the 204th character.. so I opened up the tags file navigated to the line containing the _identifier.py problem and used 204| where | is the pipe symbol to navigate to the 204th characters..

It even gets weirder.. I can't even do $ to navigate to the last character of the line

On line 211, I tried navigating to the last character and I was able to do that..

Then I tried to navigate to the last character using $ on the 212th line.. but couldn't do so.. even though there are characters left on the line.. I can't even navigate past the 65th character

lilydjwg · 2018-07-23T07:08:02Z

There is a corrupted character in your tags, and Vim has issues calculating the correct display width of the broken bytes.

You can report issues to the tool that has generated this tags file. (It seems to me that the tool clips long strings by bytes instead of characters.)

alphaCTzo7G · 2018-07-24T01:17:23Z

@lilydjwg .. actually not sure if its a problem with Universal ctags, (compiled a few days back.) there seems to be multiple issues here.. One was with the python-api..

I thought what I posted above was related but it seems the issue in this image

is related to the terminal I use (Putty), but is also present in Kitty, MobaXterm, BitVise ssh client as well tmux/tmux#1409 (comment)

If I try this in Gvim, or use a different terminal such as Gnome-terminal, or rxvt-unicode that has unicode support, this issue in the above image doesn't happen. So I will probably have to report these to them (Putty/Kitty etc).. Its not too much of a issue.. and probably a edge case, so maybe thats why they haven't bothered to fix it.

vim.eval still breaks for the 3 files I posted regardless of what terminal I use or if I use Gvim..

vim-ml · 2018-07-24T23:24:10Z

On Tue, Jul 24, 2018 at 4:17 AM alphaCTzo7G ***@***.***> wrote: @lilydjwg <https://github.com/lilydjwg> .. actually not sure if its a problem with Universal ctags, (compiled a few days back.) there seems to be multiple issues here.. One was with the python-api.. I thought what I posted above was related but it seems the issue in this image [image: image] <https://user-images.githubusercontent.com/29691914/43110872-305860be-8ea3-11e8-99b1-5c4a747bd40d.png> is related to the terminal I use (Putty), but is also present in Kitty, MobaXterm, BitVise ssh client as well tmux/tmux#1409 (comment) <tmux/tmux#1409 (comment)> If I try this in Gvim, or use a different terminal such as Gnome-terminal, or rxvt-unicode that has unicode support, this issue in the above image doesn't happen. So I will probably have to report these to them (Putty/Kitty etc).. Its not too much of a issue.. and probably a edge case, so maybe thats why they haven't bothered to fix it. vim.eval still breaks for the 3 files I posted regardless of what terminal I use or if I use Gvim..

`vim.bindeval` always uses `bytes` in place of `str` regardless of `&encoding` and Python version, so you can always choose what you to do with non-UTF-8 text. Though it is not without issues (or features which may cause issues): - For each container (list, dictionary) it will return not Python’s `list` or `dict`, but a binding to the container. That may need to be converted depending on where you pass that container. - `bytes` is harder to use in Python 3 as they are no longer meant to be text strings. You probably just not *can* choose in this case, but *must* choose each time. - Since Python is in external process in Neovim naturally Neovim does not support `bindeval`. (And should it support each access would invoke IPC which directly opposes the initial reason of `bindeval` existence - to make certain code faster.)

…

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3213 (comment)> -- -- You received this message from the "vim_dev" maillist. Do not top-post! Type your reply below the text you are replying to. For more information, visit http://www.vim.org/maillist.php --- You received this message because you are subscribed to the Google Groups "vim_dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to ***@***.*** For more options, visit https://groups.google.com/d/optout.

alphaCTzo7G · 2018-07-30T02:12:40Z

There is a corrupted character in your tags, and Vim has issues calculating the correct display width of the broken bytes.

@lilydjwg, actually you are right about bringing up the tags file. Thanks for pointing this out.. for some reason, I didn't expect that even though the _identifier.py had all valid characters, ctags would insert invalid utf-8 characters.

While the picture I posted before regarding the location of the cursor is
related to the terminal, it seems that the current version of Ctags indeed
inserts a incorrect character into the file.

Here's some evidence. Note to isolate the problem, this is run a fresh Ubuntu
16.04 VM installed today. The only modifications made are to some libraries
necessary to compile ctags, vim from source.

With exuberant-ctags installed using just sudo apt-get install ctags:

With Universal-ctags compiled from source (latest commit) as of this post, compiled with instructions from here: https://github.com/universal-ctags/ctags/blob/master/docs/autotools.rst:

alphaCTzo7G · 2018-07-30T16:35:30Z

just thought I would update this..

@b4n has created a PR here to Universal-ctags: universal-ctags/ctags#1807 and it seems that ctags no longer cutting at an arbitrary byte position.

misc.html still doesn't work, as misc.html indeed contains a invalid utf-8 character, which ctags inserts into the tags file.. and thus vim.eval fails..

I will investigate the other option that that others proposed here..

k-takata · 2018-07-31T02:08:34Z

misc.html is encoded with Windows-1252 as it is described in the HTML header.
Try passing --input-encoding=Windows-1252 (or --input-encoding=cp1252) and --output-encoding=utf-8 to Universal Ctags.

alphaCTzo7G · 2018-07-31T04:22:05Z

@k-takata .. Thanks I have seen that.. unfortunately, I won't know what the encoding is for any given file for a random library that could have thousands of files. This solution may work for misc.html, but will will it work if there is another file with a different (non-utf-8) encoding? The files that I posted were mainly to narrow down the problem.

Based on what @b4n mentioned here: universal-ctags/ctags#1805 (comment)..

it seems the only way around this is to postprocess the tags file or if vim.eval doesn't break if it receives non-utf-8 encoded files/string or if ctags individually determines the type of each file and then creates a tag file with the utf-8 encoding? Do you have any better way for the general case where there are file with potentially multiple different types of encoding in the same repository?

chrisbra · 2018-08-23T09:02:13Z

I suppose this is not a Vim issue, so this can be closed, right?

alphaCTzo7G changed the title ~~vim.eval breaks, if the string a:items contains invalid utf-8 characters~~ PYTHON-API: vim.eval breaks, if the string a:items contains invalid utf-8 characters Jul 15, 2018

k-takata added the if_python label Jul 15, 2018

alphaCTzo7G mentioned this issue Jul 20, 2018

[Encoding] For certain files characters cross over the buffer boundaries #3237

Closed

alphaCTzo7G mentioned this issue Jul 30, 2018

Universal ctags inserts invalid utf-8 characters for certain files universal-ctags/ctags#1805

Closed

chrisbra closed this as completed Aug 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PYTHON-API: `vim.eval` breaks, if the string `a:items` contains invalid `utf-8` characters #3213

PYTHON-API: `vim.eval` breaks, if the string `a:items` contains invalid `utf-8` characters #3213

alphaCTzo7G commented Jul 15, 2018 •

edited

brammool commented Jul 16, 2018 via email

tonymec commented Jul 16, 2018 •

edited

alphaCTzo7G commented Jul 16, 2018

lilydjwg commented Jul 16, 2018 •

edited

lilydjwg commented Jul 16, 2018

alphaCTzo7G commented Jul 16, 2018

alphaCTzo7G commented Jul 16, 2018 •

edited

alphaCTzo7G commented Jul 16, 2018

lilydjwg commented Jul 17, 2018

alphaCTzo7G commented Jul 17, 2018

alphaCTzo7G commented Jul 20, 2018

alphaCTzo7G commented Jul 21, 2018 •

edited

lilydjwg commented Jul 23, 2018

alphaCTzo7G commented Jul 24, 2018

vim-ml commented Jul 24, 2018 via email

alphaCTzo7G commented Jul 30, 2018

alphaCTzo7G commented Jul 30, 2018

k-takata commented Jul 31, 2018

alphaCTzo7G commented Jul 31, 2018

chrisbra commented Aug 23, 2018

PYTHON-API: vim.eval breaks, if the string a:items contains invalid utf-8 characters #3213

PYTHON-API: vim.eval breaks, if the string a:items contains invalid utf-8 characters #3213

Comments

alphaCTzo7G commented Jul 15, 2018 • edited

brammool commented Jul 16, 2018 via email

tonymec commented Jul 16, 2018 • edited

alphaCTzo7G commented Jul 16, 2018

lilydjwg commented Jul 16, 2018 • edited

lilydjwg commented Jul 16, 2018

alphaCTzo7G commented Jul 16, 2018

alphaCTzo7G commented Jul 16, 2018 • edited

alphaCTzo7G commented Jul 16, 2018

lilydjwg commented Jul 17, 2018

alphaCTzo7G commented Jul 17, 2018

alphaCTzo7G commented Jul 20, 2018

alphaCTzo7G commented Jul 21, 2018 • edited

lilydjwg commented Jul 23, 2018

alphaCTzo7G commented Jul 24, 2018

vim-ml commented Jul 24, 2018 via email

alphaCTzo7G commented Jul 30, 2018

alphaCTzo7G commented Jul 30, 2018

k-takata commented Jul 31, 2018

alphaCTzo7G commented Jul 31, 2018

chrisbra commented Aug 23, 2018

PYTHON-API: `vim.eval` breaks, if the string `a:items` contains invalid `utf-8` characters #3213

PYTHON-API: `vim.eval` breaks, if the string `a:items` contains invalid `utf-8` characters #3213

alphaCTzo7G commented Jul 15, 2018 •

edited

tonymec commented Jul 16, 2018 •

edited

lilydjwg commented Jul 16, 2018 •

edited

alphaCTzo7G commented Jul 16, 2018 •

edited

alphaCTzo7G commented Jul 21, 2018 •

edited