ted crashs/hangs on binary files [bug] #40

bynect · 2021-03-30T16:37:31Z

I erroneously opened a binary file in ted and it behaved strangely (misplaced cursor and line numbers), and after a certain line (300 or something) just hanged. I have yet to reproduce this behavior on multiple files, but I will post updates on this bug soon.

bynect · 2021-03-30T16:42:53Z

Maybe its a problem with the utf8 encoding/decoding mechanism

arthurbacci · 2021-03-30T16:43:36Z

Maybe its a problem with the utf8 encoding/decoding mechanism

Try opening without utf8

arthurbacci · 2021-03-30T16:44:02Z

Sorry, closed for accident.

bynect · 2021-03-30T16:44:49Z

Maybe its a problem with the utf8 encoding/decoding mechanism

Try opening without utf8

Is there a way to disable utf8?

arthurbacci · 2021-03-30T16:48:44Z

That's were the UTF8 is read

arthurbacci · 2021-03-30T16:49:33Z

Just comment this part of the code

bynect · 2021-03-30T16:49:57Z

Ok I'll try this asap

bynect · 2021-03-30T16:58:46Z

Ok I tried it and seems like that when commenting that part of the code, lines are no longer misplaced, but now ted hangs on line ~450. ~~But maybe this problem is more related to memory (?)~~
Now I tried opening the ted binary itself and it runs completely fine (I scrolled down to line 1500), so maybe its not a memory problem after all.

bynect · 2021-03-30T17:06:46Z

Maybe the problem lies within a call to utf8ToMultibyte from show_lines, but I am unsure on how to handle that

bynect · 2021-03-30T17:27:00Z

I have retried and ted reads successfully till the end of the binary file, however it still misplaces the cursor sometimes and up to a certain point becomes so clunky I can't even tell if its crashed or just slow, so this seems like a bug to me

arthurbacci · 2021-03-30T19:59:26Z

I think we need a system to ignore characters that don't represent anything in text.

bynect · 2021-03-30T20:03:55Z

I think we need a system to ignore characters that don't represent anything in text.

By replacing with something like this (https://www.compart.com/en/unicode/U+16844) ? I think its a very good idea, and even production grade editors do that.

bynect · 2021-03-30T20:04:55Z

The worst problem is that in case of a malformed unicode sequence we could read arrays out of bound or end up in other UB situations

bynect · 2021-03-30T20:06:26Z

Also I have successfully read over 5000 lines with ted, so this theory about memory is officially disproved.

Ok I tried it and seems like that when commenting that part of the code, lines are no longer misplaced, but now ted hangs on line ~450. ~~But maybe this problem is more related to memory (?)~~
Now I tried opening the ted binary itself and it runs completely fine (I scrolled down to line 1500), so maybe its not a memory problem after all.

arthurbacci · 2021-03-30T20:13:36Z

Also I have successfully read over 5000 lines with ted, so this theory about memory is officially disproved.

Ok I tried it and seems like that when commenting that part of the code, lines are no longer misplaced, but now ted hangs on line ~450. ~~But maybe this problem is more related to memory (?)~~
Now I tried opening the ted binary itself and it runs completely fine (I scrolled down to line 1500), so maybe its not a memory problem after all.

In theory, ted can read ~4gb

bynect · 2021-03-30T21:04:42Z

The worst problem is that in case of a malformed unicode sequence we could read arrays out of bound or end up in other UB situations

I'll look into this, quoting from rfc 3629

Implementations of the decoding algorithm above MUST protect against
decoding invalid sequences. For instance, a naive implementation may
decode the overlong UTF-8 sequence C0 80 into the character U+0000,
or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding
invalid sequences may have security consequences or cause other
problems. See Security Considerations (Section 10) below.

bynect · 2021-03-30T22:04:38Z

This is an example of the line numbers/cursor misplacement

bynect · 2021-03-30T22:07:28Z

I have tried to add the utf8 validation but I am unsure if it is actually working well, I would appreciate if you can check here https://github.com/bynect/Teditor/blob/utf8try/src/utf8.c (I have made only small changes)

I tried to replace all invalid codepoints with this one

arthurbacci · 2021-03-30T22:11:20Z

This is an example of the line numbers/cursor misplacement

I think it can be a problem with the line break type detection. Maybe it detects it as CRLF, but a loose LF is not interpreted as a line?

bynect · 2021-03-30T22:19:06Z

This is an example of the line numbers/cursor misplacement

I think it can be a problem with the line break type detection. Maybe it detects it as CRLF, but a loose LF is not interpreted as a line?

I cut it out from the screenshot, but the file was detected as lf, and to my knowledge all files, binary or not, are lf terminated on Linux

arthurbacci · 2021-03-30T22:24:11Z

This is an example of the line numbers/cursor misplacement

I think it can be a problem with the line break type detection. Maybe it detects it as CRLF, but a loose LF is not interpreted as a line?

I cut it out from the screenshot, but the file was detected as lf, and to my knowledge all files, binary or not, are lf terminated on Linux

If you download a file that was made in Windows, it probally will be CRLF.

bynect · 2021-03-30T22:26:36Z

This is an example of the line numbers/cursor misplacement

I think it can be a problem with the line break type detection. Maybe it detects it as CRLF, but a loose LF is not interpreted as a line?

I cut it out from the screenshot, but the file was detected as lf, and to my knowledge all files, binary or not, are lf terminated on Linux

If you download a file that was made in Windows, it probally will be CRLF.

Well yes, but in this case the file was produced on Linux, specifically by gcc from a little test c code

arthurbacci · 2021-03-30T22:30:49Z

I will take a look at it later, now I am changing other things in the code.

bynect · 2021-03-31T12:30:54Z

Regarding this I think I have just found the root cause of misplaced lines et al, which is the way read_lines read linebreaks. Ive also found out that binary files are CRLF terminated even on linux, and this problem is related to that

arthurbacci · 2021-03-31T13:33:57Z

Probably the problem is that ted is trying to find the type of the line break, while it does not have a consistent line break type.

bynect · 2021-03-31T13:42:02Z

Probably the problem is that ted is trying to find the type of the line break, while it does not have a consistent line break type.

Exactly, furthermore I found that read_lines discards a character after having found a carriage return without checking for the linefeed in CRLF mode. Also I was trying to implement an heuristics that would guess in an acceptable manner if the file is valid utf8/non binary.

arthurbacci · 2021-03-31T14:09:49Z

If we remove CR line break support, we can just ignore carriage returns when reading the file.

arthurbacci · 2021-03-31T14:10:13Z

CR is not used in any modern operating system. There is no why to support it.

arthurbacci · 2021-08-04T22:16:57Z

If we remove CR line break support, we can just ignore carriage returns when reading the file.

48523c4

arthurbacci closed this as completed Mar 30, 2021

arthurbacci reopened this Mar 30, 2021

arthurbacci pinned this issue Mar 30, 2021

arthurbacci added the bug Something isn't working label Mar 30, 2021

bynect mentioned this issue Apr 1, 2021

(Try to) fix Utf8 #54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ted crashs/hangs on binary files [bug] #40

ted crashs/hangs on binary files [bug] #40

bynect commented Mar 30, 2021 •

edited

bynect commented Mar 30, 2021

arthurbacci commented Mar 30, 2021

arthurbacci commented Mar 30, 2021

bynect commented Mar 30, 2021

arthurbacci commented Mar 30, 2021

arthurbacci commented Mar 30, 2021

bynect commented Mar 30, 2021

bynect commented Mar 30, 2021 •

edited

bynect commented Mar 30, 2021 •

edited

bynect commented Mar 30, 2021 •

edited

arthurbacci commented Mar 30, 2021

bynect commented Mar 30, 2021

bynect commented Mar 30, 2021

bynect commented Mar 30, 2021

arthurbacci commented Mar 30, 2021

bynect commented Mar 30, 2021 •

edited

bynect commented Mar 30, 2021

bynect commented Mar 30, 2021 •

edited

arthurbacci commented Mar 30, 2021

bynect commented Mar 30, 2021 •

edited

arthurbacci commented Mar 30, 2021

bynect commented Mar 30, 2021

arthurbacci commented Mar 30, 2021

bynect commented Mar 31, 2021

arthurbacci commented Mar 31, 2021

bynect commented Mar 31, 2021

arthurbacci commented Mar 31, 2021

arthurbacci commented Mar 31, 2021 •

edited

arthurbacci commented Aug 4, 2021

ted crashs/hangs on binary files [bug] #40

ted crashs/hangs on binary files [bug] #40

Comments

bynect commented Mar 30, 2021 • edited

bynect commented Mar 30, 2021

arthurbacci commented Mar 30, 2021

arthurbacci commented Mar 30, 2021

bynect commented Mar 30, 2021

arthurbacci commented Mar 30, 2021

arthurbacci commented Mar 30, 2021

bynect commented Mar 30, 2021

bynect commented Mar 30, 2021 • edited

bynect commented Mar 30, 2021 • edited

bynect commented Mar 30, 2021 • edited

arthurbacci commented Mar 30, 2021

bynect commented Mar 30, 2021

bynect commented Mar 30, 2021

bynect commented Mar 30, 2021

arthurbacci commented Mar 30, 2021

bynect commented Mar 30, 2021 • edited

bynect commented Mar 30, 2021

bynect commented Mar 30, 2021 • edited

arthurbacci commented Mar 30, 2021

bynect commented Mar 30, 2021 • edited

arthurbacci commented Mar 30, 2021

bynect commented Mar 30, 2021

arthurbacci commented Mar 30, 2021

bynect commented Mar 31, 2021

arthurbacci commented Mar 31, 2021

bynect commented Mar 31, 2021

arthurbacci commented Mar 31, 2021

arthurbacci commented Mar 31, 2021 • edited

arthurbacci commented Aug 4, 2021

bynect commented Mar 30, 2021 •

edited

bynect commented Mar 30, 2021 •

edited

bynect commented Mar 30, 2021 •

edited

bynect commented Mar 30, 2021 •

edited

bynect commented Mar 30, 2021 •

edited

bynect commented Mar 30, 2021 •

edited

bynect commented Mar 30, 2021 •

edited

arthurbacci commented Mar 31, 2021 •

edited