Avoid truncating a pattern in the middle of an UTF-8 character #1807

b4n · 2018-07-30T10:20:18Z

Fixes #1275, fixes #1805.

coveralls · 2018-07-30T10:27:57Z

Coverage increased (+0.003%) to 84.415% when pulling 22b51fa on b4n:utf8-pattern-truncation into 3522685 on universal-ctags:master.

Fixes universal-ctags#1275, fixes universal-ctags#1805.

`-1` was for stripping the newline, and has no reason to be repeated for truncation length as truncation occurs only if it would result in a strictly shorter string anyway (at most `len - 1` already).

b4n · 2018-07-30T13:15:03Z

Tmain/omit-long-patterns-etags.d/stdout-expected.txt

                                                                                               func95()func955,110
-                                                                                                 func97func979,219
+                                                                                                 func97(func979,219


@masatake do we wanna change this test case to balance the previous off-by-one-ness? I'm not sure, because now it properly checks the behavior when the line is the exactly as long, short and longer than the truncation, which seems a good thing to do, but you might remember if you had another goal here maybe.

It seems that there a bug has been existed since Exuberant-crags about newline handling.
line returned from readLineFromBypassAnyway can end with or without a newline char.
Both can occurs. Though the code trims the line with line[len - 1] = '\0''.

To combine your fix, I think the diff should be:

diff --git a/main/writer-etags.c b/main/writer-etags.c index 762c8378..dbd6470f 100644 --- a/main/writer-etags.c +++ b/main/writer-etags.c @@ -98,18 +98,18 @@ static int writeEtagsEntry (tagWriter *writer, long seekValue; char *const line = readLineFromBypassAnyway (etags->vLine, tag, &seekValue); - if (line == NULL) + if (line == NULL || line[0] == '\0') return 0; len = strlen (line); if (tag->truncateLineAfterTag) truncateTagLineAfterTag (line, tag->name, true); - else - line [len - 1] = '\0'; + else if (line [len - 1] == '\n') + line [--len] = '\0'; if (Option.patternLengthLimit < len) - line [Option.patternLengthLimit - 1] = '\0'; + line [Option.patternLengthLimit] = '\0'; length = mio_printf (mio, "%s\177%s\001%lu,%ld\n", line, tag->name, tag->lineNumber, seekValue);

masatake · 2018-07-30T13:16:30Z

@b4n, thank you very much. I will take a look at this weekend.

masatake · 2018-08-05T09:56:27Z

@b4n, thank you for working on this topic.
Can I ask you to add explanation about the length adjustment for utf-8 input to Truncating the pattern for long input lines section in docs/news.rst?

@masatake

Value returned by `readLineFromBypassAnyway()` can end with or without a newline character. Thus, make sure to handle both cases gracefully. Patch by @masatake.

b4n · 2018-08-05T15:13:23Z

@masatake Thanks for the review! I made the changes, tell me if it was what you were expecting.

Avoid truncating a pattern in the middle of a UTF-8 character

b4n added the Main part label Jul 30, 2018

b4n requested a review from masatake July 30, 2018 10:20

b4n mentioned this pull request Jul 30, 2018

Universal ctags inserts invalid utf-8 characters for certain files #1805

Closed

b4n added 5 commits July 30, 2018 15:01

Avoid truncating a pattern in the middle of a UTF-8 character

7fb2193

Fixes universal-ctags#1275, fixes universal-ctags#1805.

etags: Fix crash when pattern-length-limit=0

b4d9e03

etags: Fix off-by-one pattern-length-limit truncation

cab7995

`-1` was for stripping the newline, and has no reason to be repeated for truncation length as truncation occurs only if it would result in a strictly shorter string anyway (at most `len - 1` already).

etags: Avoid truncating a pattern in the middle of a UTF-8 character

4f39a65

Add test cases for UTF-8 pattern truncation

c971a39

b4n force-pushed the utf8-pattern-truncation branch from 651ce41 to c971a39 Compare July 30, 2018 13:09

b4n commented Jul 30, 2018

View reviewed changes

This was referenced Jul 30, 2018

PYTHON-API: vim.eval breaks, if the string a:items contains invalid utf-8 characters vim/vim#3213

Closed

Tag search in ctrlp-py-matcher FelikZ/ctrlp-py-matcher#43

Open

b4n added 2 commits August 5, 2018 16:54

etags: Better handling of line endings

5d73ebe

Value returned by `readLineFromBypassAnyway()` can end with or without a newline character. Thus, make sure to handle both cases gracefully. Patch by @masatake.

Update the docs to mention UTF-8 handling when truncating a pattern

22b51fa

masatake approved these changes Aug 5, 2018

View reviewed changes

b4n merged commit 22b51fa into universal-ctags:master Aug 6, 2018

b4n added a commit that referenced this pull request Aug 6, 2018

Merge pull request #1807 from b4n/utf8-pattern-truncation

0d56013

Avoid truncating a pattern in the middle of a UTF-8 character

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid truncating a pattern in the middle of an UTF-8 character #1807

Avoid truncating a pattern in the middle of an UTF-8 character #1807

b4n commented Jul 30, 2018 •

edited

coveralls commented Jul 30, 2018 •

edited

b4n Jul 30, 2018

masatake Aug 5, 2018

masatake commented Jul 30, 2018

masatake commented Aug 5, 2018

b4n commented Aug 5, 2018

Avoid truncating a pattern in the middle of an UTF-8 character #1807

Avoid truncating a pattern in the middle of an UTF-8 character #1807

Conversation

b4n commented Jul 30, 2018 • edited

coveralls commented Jul 30, 2018 • edited

b4n Jul 30, 2018

Choose a reason for hiding this comment

masatake Aug 5, 2018

Choose a reason for hiding this comment

masatake commented Jul 30, 2018

masatake commented Aug 5, 2018

b4n commented Aug 5, 2018

b4n commented Jul 30, 2018 •

edited

coveralls commented Jul 30, 2018 •

edited