Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid truncating a pattern in the middle of an UTF-8 character #1807

Merged
merged 7 commits into from Aug 6, 2018

Conversation

b4n
Copy link
Member

@b4n b4n commented Jul 30, 2018

Fixes #1275, fixes #1805.

@coveralls
Copy link

coveralls commented Jul 30, 2018

Coverage Status

Coverage increased (+0.003%) to 84.415% when pulling 22b51fa on b4n:utf8-pattern-truncation into 3522685 on universal-ctags:master.

@b4n b4n force-pushed the utf8-pattern-truncation branch from 651ce41 to c971a39 Compare July 30, 2018 13:09
func95()func955,110
func97func979,219
func97(func979,219
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masatake do we wanna change this test case to balance the previous off-by-one-ness? I'm not sure, because now it properly checks the behavior when the line is the exactly as long, short and longer than the truncation, which seems a good thing to do, but you might remember if you had another goal here maybe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there a bug has been existed since Exuberant-crags about newline handling.
line returned from readLineFromBypassAnyway can end with or without a newline char.
Both can occurs. Though the code trims the line with line[len - 1] = '\0''.

To combine your fix, I think the diff should be:

diff --git a/main/writer-etags.c b/main/writer-etags.c
index 762c8378..dbd6470f 100644
--- a/main/writer-etags.c
+++ b/main/writer-etags.c
@@ -98,18 +98,18 @@ static int writeEtagsEntry (tagWriter *writer,
                long seekValue;
                char *const line =
                                readLineFromBypassAnyway (etags->vLine, tag, &seekValue);
-               if (line == NULL)
+               if (line == NULL || line[0] == '\0')
                        return 0;
 
                len = strlen (line);
 
                if (tag->truncateLineAfterTag)
                        truncateTagLineAfterTag (line, tag->name, true);
-               else
-                       line [len - 1] = '\0';
+               else if (line [len - 1] == '\n')
+                       line [--len] = '\0';
 
                if (Option.patternLengthLimit < len)
-                       line [Option.patternLengthLimit - 1] = '\0';
+                       line [Option.patternLengthLimit] = '\0';
 
                length = mio_printf (mio, "%s\177%s\001%lu,%ld\n", line,
                                tag->name, tag->lineNumber, seekValue);

@masatake
Copy link
Member

@b4n, thank you very much. I will take a look at this weekend.

@masatake
Copy link
Member

masatake commented Aug 5, 2018

@b4n, thank you for working on this topic.
Can I ask you to add explanation about the length adjustment for utf-8 input to Truncating the pattern for long input lines section in docs/news.rst?

b4n added 2 commits August 5, 2018 16:54
Value returned by `readLineFromBypassAnyway()` can end with or without
a newline character.  Thus, make sure to handle both cases gracefully.

Patch by @masatake.
@b4n
Copy link
Member Author

b4n commented Aug 5, 2018

@masatake Thanks for the review! I made the changes, tell me if it was what you were expecting.

@b4n b4n merged commit 22b51fa into universal-ctags:master Aug 6, 2018
b4n added a commit that referenced this pull request Aug 6, 2018
Avoid truncating a pattern in the middle of a UTF-8 character
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Universal ctags inserts invalid utf-8 characters for certain files Unicode parse incorrect!
3 participants