Fix custom encodings in filenames #15

Gemorroj · 2018-10-14T09:32:37Z

It is possible to check the encoding via mb_detect_encoding...
Or at least UTF-8

The text was updated successfully, but these errors were encountered:

Gemorroj · 2018-10-14T10:42:08Z

for zip achivers (and some others) can use Host OS attribute (Unix|Win32|FAT)

Gemorroj · 2018-10-14T15:17:07Z

or change current system locale...

LANG=en_US.IBM866 && 7z l test.zip

Gemorroj · 2018-10-14T18:08:07Z

https://sourceforge.net/p/p7zip/discussion/383044/thread/3d213124/
but... we don't known original codepage....

 7z l -mcp=866 test.zip
 7z l -mcp=1252 test.zip

no effects

silverqx · 2019-12-18T17:42:28Z

What exactly is the problem? I have 10 failing tests in wsl, are these failing tests related to this issue?

silverqx · 2019-12-18T17:55:27Z

i'm pasting failed tests output, they seem to be related to encoding.
failed_tests.txt

Maikuolan · 2020-07-14T09:38:16Z

Just now, I saw this issue, and I was thinking.. there's a PHP class I wrote, which is intended to automatically detect and convert encoding of data, for cases such as these (with.. moderate success), and was wondering.. could that class be helpful for this situation here..? So, I did some testing (just some brief 5-minute testing).. but ran into a problem: The sample filename provided at the SourceForge discussion conforms with several different encodings (actually.. at least ~26.. assuming that I haven't made any mistakes at my end when writing the class in the first place), so I don't think it would be possible (or at least, it wouldn't be something easy/quick/simple) to definitively/conclusively determine one encoding above all the others.. 'x.x (so, maybe not helpful here.. but I thought, maybe I should share the results of this testing anyway, in case it inspires some ideas for possible solutions, or in case it inspires some more thinking from others, which maybe will eventually lead to a solution).

<?php
// Note: The sample's extension (".doc") intentionally omitted here, for easier processing, simpler testing, etc.
$Sample = 'ÂŽÂ›ÂžÃ¥ÂœÂª Â©Â¬Â£Â§Â¢Ã£Â¨Ã Â©ÂžÂª Â˜Ã¥Â«ÂžÂ©ÂžÂª_Â™';

$Demojibakefier = new \Maikuolan\Common\Demojibakefier();

foreach ($Demojibakefier->supported() as $CharSet) {
    echo $Demojibakefier->checkConformity($Sample, $CharSet) ? 'Conforms with ' . $CharSet . ".\n" : 'Does not conform with '  . $CharSet . ".\n";
}

Produces:

Conforms with UTF-8.
Does not conform with UTF-16BE.
Does not conform with UTF-16LE.
Does not conform with ISO-8859-1.
Conforms with CP1252.
Does not conform with ISO-8859-2.
Does not conform with ISO-8859-3.
Does not conform with ISO-8859-4.
Does not conform with ISO-8859-5.
Does not conform with ISO-8859-6.
Does not conform with ISO-8859-7.
Does not conform with ISO-8859-8.
Does not conform with ISO-8859-9.
Does not conform with ISO-8859-10.
Does not conform with ISO-8859-11.
Does not conform with ISO-8859-13.
Does not conform with ISO-8859-14.
Does not conform with ISO-8859-15.
Does not conform with ISO-8859-16.
Does not conform with CP1250.
Does not conform with CP1251.
Does not conform with CP1253.
Does not conform with CP1254.
Does not conform with CP1255.
Conforms with CP1256.
Does not conform with CP1257.
Does not conform with CP1258.
Conforms with GB18030.
Conforms with GB2312.
Conforms with BIG5.
Does not conform with SHIFT-JIS.
Conforms with JOHAB.
Does not conform with UCS-2.
Does not conform with UTF-32BE.
Does not conform with UTF-32LE.
Does not conform with UCS-4.
Conforms with CP437.
Conforms with CP737.
Conforms with CP775.
Conforms with CP850.
Conforms with CP852.
Conforms with CP855.
Conforms with CP857.
Conforms with CP860.
Conforms with CP861.
Conforms with CP862.
Conforms with CP863.
Does not conform with CP864.
Conforms with CP865.
Conforms with CP866.
Conforms with CP869.
Does not conform with CP874.
Conforms with KOI8-RU.
Conforms with KOI8-R.
Conforms with KOI8-U.
Conforms with KOI8-F.
Does not conform with KOI8-T.
Does not conform with CP037.
Does not conform with CP500.
Conforms with CP858.
Does not conform with CP875.
Does not conform with CP1026.

If we wanted to get clever, it might be possible to "guess" which encoding is used, by comparing the bytes per where the bytes match up against each character in various encodings, against frequency tables for the occurrence of specific characters in different languages.. but that's a lot of work, prone to false positives (e.g., if someone just uses random characters or weird filenames), and is also very outside the scope of responsibility for Archive7z (so, I would not recommend that).

Probably, in order to solve this effectively, would need to be able to get the information from somewhere (e.g., the implementation, the O.S., etc).

Maybe just add a public property somewhere to Archive7z, to allow the implementation to specify the preferred encoding, and then work from that (falling back either to UTF-8, or to some kind of "best guess", if the implementation fails to populate the property)? That way, the problem becomes the responsibility of the implementation, and not Archive7z.

Gemorroj · 2020-07-19T14:32:04Z

have path for p7zip https://github.com/unxed/oemcp/blob/master/p7zip_oemcp_ZipItem.cpp.patch

if (!isUtf8 && ((hostOS == NFileHeader::NHostOS::kFAT) || (hostOS == NFileHeader::NHostOS::kNTFS))) {

and i see that path based on is UTF-8 iformation (is trivial) and check hostOS attribute.
and then use LC_CTYPE to detect current locale and codepage.

Gemorroj self-assigned this Oct 14, 2018

Gemorroj added this to the 5.0.0 milestone Oct 14, 2018

Gemorroj added the help wanted Issues and PRs which are looking for volunteers to complete them. label Oct 21, 2018

Gemorroj modified the milestones: 5.0.0, 6.0.0 Dec 31, 2019

Gemorroj mentioned this issue Aug 5, 2022

Question marks in Cyrillic file names #31

Open

Gemorroj mentioned this issue May 14, 2023

Hello, what is the problem causing this error? #34

Closed

Gemorroj removed this from the 6.0.0 milestone Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix custom encodings in filenames #15

Fix custom encodings in filenames #15

Gemorroj commented Oct 14, 2018

Gemorroj commented Oct 14, 2018 •

edited

Gemorroj commented Oct 14, 2018 •

edited

Gemorroj commented Oct 14, 2018 •

edited

silverqx commented Dec 18, 2019

silverqx commented Dec 18, 2019

Maikuolan commented Jul 14, 2020

Gemorroj commented Jul 19, 2020

Fix custom encodings in filenames #15

Fix custom encodings in filenames #15

Comments

Gemorroj commented Oct 14, 2018

Gemorroj commented Oct 14, 2018 • edited

Gemorroj commented Oct 14, 2018 • edited

Gemorroj commented Oct 14, 2018 • edited

silverqx commented Dec 18, 2019

silverqx commented Dec 18, 2019

Maikuolan commented Jul 14, 2020

Gemorroj commented Jul 19, 2020

Gemorroj commented Oct 14, 2018 •

edited

Gemorroj commented Oct 14, 2018 •

edited

Gemorroj commented Oct 14, 2018 •

edited