New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Universal ctags inserts invalid utf-8
characters for certain files
#1805
Comments
Sounds like #1275 to me: the new Something like #1275 (comment) should probably be implemented to fix this. |
@alphaCTzo7G see #1807, does that properly fixes it for you? |
@b4n, thanks for your quick response... On the file that I posted here I will try out this PR on my real system over the next few days to see if it works for my whole repositories or emits other errors As There was another file which I found was causing problems, with Does it make sense for |
IIUC, ctags (Exuberant ctags, I mean, which is only one of the ctags programs available) is distributed separately from Vim (even if its author knows Bram and even if they occasionnally work together to make Vim and ctags work better together. From a ctags point of view, it is legitimate to treat program text as just strings of bytes: regardless of whether it is UTF-8, Latin1, Latin9 or some other ISO 8859 charset, a space is 0x20, a hard tab is 0x09, a line break is 0x0A possibly preceded by 0x0D, etc.; and a null byte, which would be 0x00, should not appear in a text file. Ctags treats every program in the same way regardless of which ASCII-compatible encoding it is written in, and therefore it doesn't need to care about which is which. Only for some outlandish charsets like EBCDIC does it need to treat the text as definitely non-ASCII (in EBCDIC, IIRC, A-I are 0xC1-0xC9, J-R are 0xD1-0xD9, S-Z are 0xE2-0xE9, 0-9 are 0xF0-0xF9, and I don't remember what the codes are for a space, a tab, a line break, a dash, an underscore, etc.; but you see that from an ASCII viewpoint it is really outlandish). IMHO, in ctag's case, the good old principle applies: garbage in, garbage out. Best regards, |
@tonymec .. makes sense.. I realize that there may be other tag generation programs, but So I wondering if these 2 might work or you have any other ideas of how to handle files that have illegal
This is explained in section 1.3.4 of https://media.readthedocs.org/pdf/ctags/latest/ctags.pdf:
|
@alphaCTzo7G I agree with @tonymec and his conclusion. Unfortunately, it's a lot of trouble recognizing the proper encoding -- and I insist on proper, because it's easy to find an encoding in which the input would be technically valid, say most if not all 8bit encodings would, but knowing whether it's the right one is tricky or impossible: say, how can one be sure between e.g. ISO 8859-1 and 8859-15? Solutions include complex heuristic about usage frequency and context; or a more naive idea applicable to some languages like HTML would be extracting the encoding statement inside the file, but that can be incorrect just as well. Also, ctags stands in a difficult position here: many, if not most, consumers don't handle encodings, and generated tags need to match at the byte level. For example, grepping for a tag pattern or even name won't convert encodings for you, so the tag should match the file at the byte level. It was easy when all we had to care about was ASCII, but we're not so lucky anymore… UTF-8 didn't get adopted early enough. However, if you're happy with replacing invalid UTF-8 with U+FFFD or stripping them, maybe you could simply post-process ctags' output? |
@b4n, appreciate your comment. I actually deal mostly with I use To detect the encoding of the file can't you use the underlying libraries behind one of these options: https://stackoverflow.com/questions/805418/how-to-find-encoding-of-a-file-in-unix-via-scripts such as I will try out the |
(
Thank you for contacting us.
If you are reporting an issue with the parsing output, please fill
the following template. As your custom CTags configuration can
affect results, please always use
--options=NONE
as the firstoption when running
ctags
.Otherwise, delete the template and write your issue from scratch.
Examples may help developers understanding your issue better.
Use GitHub web interface and markdown notation.
Using mail results broken text rendering that makes
the developers go crazy.
)
The name of the parser:
The command line you used to run ctags:
I dont have any special configurations in
.ctags
or anywhere else. This is a fresh VM that this test was run on.The content of input file: https://github.com/pallets/jinja/blob/master/jinja2/_identifier.py
The tags output you are not satisfied with:
Universal-ctags
inserts invalidutf-8
characters under certain circumstances.The tags output you expect:
Expected tag output with all valide
utf-8
characters.The version of ctags:
How do you get ctags binary:
(
The ctags binary is built on
ubuntu-16.04
VM with no modifications other than installing necessary libraries such asautomate
,autoreconf
for compilingctags
and necessary libraries to compilevim
based on https://github.com/Valloric/YouCompleteMe/wiki/Building-Vim-from-source#a-for-a-debian-like-linux-distribution-like-ubuntu-type)
@lilydjwg pointed out to me that
ctags
was inserting invalidutf-8
characters even though the file being used to generate the tags have all validutf-8
characters here:vim/vim#3213 (comment)
The compiled version of
ctags
works great in general.Recently found out, that it turns out that
ctags
has a bug due to which theold
Execuberant ctags
installed bysudo apt-get install ctags
on Ubuntu16.04 doesn't insert any invalid
utf-8
characters, but if I compileUniversal-ctags
from source and isntall it based on instructions here:https://github.com/universal-ctags/ctags/blob/master/docs/autotools.rst, it
will insert invalid
utf-8
characters. Heres the evidence:With
exuberant-ctags
installed using justsudo apt-get install ctags
:With
Universal-ctags
compiled from source (latest commit) as of this post,compiled with instructions from here:
https://github.com/universal-ctags/ctags/blob/master/docs/autotools.rst:
This causes a lot of problems in vim, because if invalid
utf-8
characters arepassed to
vim.eval
,vim.eval
breaks and this leads to no tags returned atall. Currently, there is only one way of transfering data contained in a
viml
variable to the
python-name
space, usingvim.eval
. So, any other plugin invim or else where will have similar problems as well. @ludovicchabant for
example had to post-process his tags file to stop such problems:
https://ludovic.chabant.com/devblog/2017/02/25/aaa-gamedev-with-vim/
Also he had to change
ctrl-py-matcher
to catch this issue.https://github.com/ludovicchabant/ctrlp-py-matcher/blob/2f6947480203b734b069e5d9f69ba440db6b4698/autoload/pymatcher.py#L22
There are multiple other files I have seen which have similar problems, but I
have just provided one here to narrow down the problem.
My guess is this is a bug, and I don't expect that
ctags
would do this bydesign. Can this be rectified, as this used to work fine in Exuberant Ctags
upon which Universal-ctags is based?
Ref: vim/vim#3213 (comment)
The text was updated successfully, but these errors were encountered: