Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed Formats (DOS and UTF-8) #1370

Open
tamer73 opened this issue Aug 20, 2023 · 25 comments
Open

Mixed Formats (DOS and UTF-8) #1370

tamer73 opened this issue Aug 20, 2023 · 25 comments
Labels
Priority: Low On the radar, but not the most urgent thing Status: Confirmed Verified by someone other than the reporter

Comments

@tamer73
Copy link

tamer73 commented Aug 20, 2023

What Happened?

Error message says "No text Found. Maybe corrupt or no text file" while trying to open an 178kb php file with code version 7.1.0-1 on Manjaro Linux with everything up to date. Every other php file is working fine like expected! Looks like code cant read the file into the buffer so it event cant realize its a php file

Steps to Reproduce

  1. Opening 178kb php file inside code or from Dolphin Browser results in the following error:
    pantheon-code_issue

Expected Behavior

Just wanted to view the php code

OS Version

Other Linux

Software Version

Latest release (I have run all updates)

Log Output

No response

Hardware Info

CPU: dual core Intel Core i3-4130 (-MT MCP-) speed/min/max: 1450/800/3400 MHz
Kernel: 6.1.44-1-MANJARO x86_64 Up: 29m Mem: 3.15/11.6 GiB (27.1%)
Storage: 461.98 GiB (3.9% used) Procs: 194 Shell: Zsh inxi: 3.3.29

@jeremypw
Copy link
Collaborator

What happens when you press "Show Anyway"? Can you load the file with nano or another simple text editor?

@jeremypw
Copy link
Collaborator

This error message is shown when the Gtk.SourceFileLoader throws an error while loading the file. I wouldnt have thought file size would be an issue on modern hardware.

@tamer73
Copy link
Author

tamer73 commented Aug 21, 2023 via email

@tamer73
Copy link
Author

tamer73 commented Aug 21, 2023 via email

@jeremypw
Copy link
Collaborator

@tamer73 Thanks for the info. Could you try running Code from the terminal command line (io.elementary.code) and see what output is produced when you try to load the problematic file? You should see a critical error message from the SourceFileLoader with more information. If you could make the problem file available it would help investigate the problem.

@tamer73
Copy link
Author

tamer73 commented Aug 22, 2023 via email

@tamer73
Copy link
Author

tamer73 commented Aug 26, 2023 via email

@tamer73
Copy link
Author

tamer73 commented Aug 27, 2023 via email

@jeremypw
Copy link
Collaborator

Ah, OK. I wonder why Gtk.SourceLoader produces that error but nano does not. Not sure if we need show that information to the user or just load the file anyway. As you found, you can use "Show Anyway" to load the file. Can you see which character(s) have been altered? If you have the right language pack(s) installed you should have all the character sets you need I would have thought.

@tamer73
Copy link
Author

tamer73 commented Aug 27, 2023 via email

@jeremypw
Copy link
Collaborator

Is the original file encoded as UTF-8 or something else?

It may be possible to fix this by setting candidate encodings in the loader so that more than one encoding is tried. If you are able to produce a non-sensitive file that still gives the error that would help develop a fix.

@tamer73
Copy link
Author

tamer73 commented Aug 27, 2023 via email

@tamer73
Copy link
Author

tamer73 commented Aug 28, 2023 via email

@jeremypw
Copy link
Collaborator

@tamer73 I think you need to post the test file to e.g. https://pastebin.com/ or maybe use the "Attach files by dragging & dropping, selecting or pasting them" function at the bottom of the GitHub comment box (although I've only ever used that for pictures). Or you could send it as an email attachment to jeremy@elementaryos.org or jeremywootten@gmail.com

@tamer73
Copy link
Author

tamer73 commented Aug 28, 2023 via email

@tamer73
Copy link
Author

tamer73 commented Aug 28, 2023 via email

@jeremypw
Copy link
Collaborator

Thanks for your efforts in narrowing down the cause ❤️ - I'll try to get a fix out soon.

@tamer73
Copy link
Author

tamer73 commented Aug 29, 2023 via email

@jeremypw
Copy link
Collaborator

jeremypw commented Aug 29, 2023

So it seems that your file is encoded in "DOS format" (according to nano) and the culprit line is converted by nano to

//######################################## �berpr�fen ########################################

and by Code to

//######################################## \FCberpr\FCfen ########################################

Two characters have been replaced by "unknown character" characters.

If you use "Save As" in Code to save the file (immediately after using "Show Anyway") with either the same or different name, close the original tab and then open the saved file it loads correctly and is recognized as PHP. This is actually intended behaviour for dealing with what Code thinks are potentially corrupted/non-text files - it stops you trying to edit them and potentially make things worse.

However, in this case the original file was misidentified as problematic due to DOS encoding which, it appears, the Gtk.SourceLoader does not handle properly by default. I'll see if there is a way round this.

@tamer73
Copy link
Author

tamer73 commented Aug 29, 2023 via email

@jeremypw
Copy link
Collaborator

I can convert you file so that Code shows the expected characters (I presume) using the command:

iconv -f ISO-8859-1 -t UTF-8//TRANSLIT edit-test2.php -o iconv-utf.php

I sent the output to a separate file to avoid overwriting the original. Opening the converted file shows:

//######################################## überprüfen ########################################

Doing an octal dump on the original shows that the problem characters are encoded as hexadecimal FC which is a non-text character so that may be why Code chokes on it without the pre-conversion.

I presume the file comes from a Windows system? Would you be wanting to return it to Windows after editing on Linux?

@jeremypw
Copy link
Collaborator

jeremypw commented Aug 29, 2023

Looking into this it is surprisingly complicated to fix. I can get Code to load the file with the Windows character set by forcing the loader to use that charset/encoding - but then "normal" Linux files have some characters misinterpreted. There does not seem to be any guaranteed way to get the encoding and charset automatically from the file before actually loading it and it seems the Gtk.SourceFileLoader only detects the encoding, not the characterset during actually loading it.

I see NotepadQQ has gone to a lot of trouble to handle a wide variety of encodings/charsets and allows the user to choose and convert between them so it is clearly possible.

However, Code is primarily targeted at developing software on Linux and there is limited resources for its development. As this is an edge case it may not be fixed soon. Probably the best we could do is to offer a choice of character sets to the user to try out on the file if it fails to load - this assumes the user knows what character set to choose is though.

The best way forward for you is probably to convert the file out of an old unsupported format and into a modern one that both Linux and Windows support out of the box.

@tamer73
Copy link
Author

tamer73 commented Aug 29, 2023

Thanks for taking a look at it. I'm ok with it and I get easily around it. Just wanted to inform you and at least we could get a little more light into this behaviour

@tamer73
Copy link
Author

tamer73 commented Aug 29, 2023

The source of this file is close to twenty years old. Needs to be reprogrammed anyway so ain't no worries. This was created from my boss in a way no one would do today anymore. So everything's fine and there's no pressure from any side :-) Should I close that here?

@jeremypw
Copy link
Collaborator

Well I'll leave it open with a revised description as it is a valid issue, but it will probably have a low priority for fixing unless another dev can see an easy fix.

@jeremypw jeremypw changed the title Max file size issue? Text file with Windows (ISO-8859-1) character set regarded as corrupt or not text Aug 29, 2023
@jeremypw jeremypw added Priority: Low On the radar, but not the most urgent thing Status: Confirmed Verified by someone other than the reporter labels Aug 29, 2023
@tamer73 tamer73 changed the title Text file with Windows (ISO-8859-1) character set regarded as corrupt or not text Mixed Formats (DOS and UTF-8) Sep 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Low On the radar, but not the most urgent thing Status: Confirmed Verified by someone other than the reporter
Projects
None yet
Development

No branches or pull requests

2 participants