Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file encoding on Linux / Windows #7

Open
odavy37 opened this issue Sep 30, 2019 · 2 comments
Open

file encoding on Linux / Windows #7

odavy37 opened this issue Sep 30, 2019 · 2 comments
Assignees
Labels

Comments

@odavy37
Copy link

odavy37 commented Sep 30, 2019

Dear Sir,

I am using pyFileFicity which is a very efficient tool.

But the output files for checksum / header / ECC seem to be OS depended (charset dependent), which can lead to errors.

For example, I computed checksums on Windows 10, then I tested the checksums on Linux (Debian buster, utf-8 encoding), and I found multiple errors.
It appears that output file encoding in Windows was CP-1215, while reading utf-8 on Linux.
It was easily fixed for the checksums output file, by transforming the file to utf-8.
But I am wondering if header and ECC output files - computed on Windows, will work if I repair the file on Linux?
Is there an option in pyfileFixity to set charset of reading / output files (utf-8)?

Best regards,

Olivier

@lrq3000 lrq3000 self-assigned this Apr 7, 2023
@lrq3000 lrq3000 added the bug label Apr 7, 2023
@lrq3000
Copy link
Owner

lrq3000 commented Apr 8, 2023

This is not normal, the encoding should be platform independent. I must have missed specifying an encoding in a few places, my bad, I am very sorry for this oversight.

It will take me some time to find the causes and fix them because the codebase is huge, but it will be done eventually.

Note to self: in headers, also add python version, OS and CPU architecture, to future proof even more (alongside pyFileFixity version).

@lrq3000
Copy link
Owner

lrq3000 commented Apr 10, 2023

But I am wondering if header and ECC output files - computed on Windows, will work if I repair the file on Linux?
Is there an option in pyfileFixity to set charset of reading / output files (utf-8)?

Checksum files (generated by rfigc.py I presume) are very different from ECC files: checksum files are stored in CSV files, whereas ECC files are binary files. So no, there should be no encoding issue at all, and I was particularly careful about the ECC implementation. ECC files are hence handled at a low level and their formatting should be consistent across platforms. They even store their own filepath metadata for future filescraping possibility, and the formatting of the path string was made consistent across platforms (and it was hella hard!).

For the CSV/rfigc.py, I also was careful, but it seems not enough: the issue is that with Python 2 CSV were also managed as binary data streams, so it was all fine, but starting with Python 3, they are considered as text streams, and hence there is now a possibility to use an encoding. Furthermore, line returns are inconsistent, and it seems this inconsistency is from the CSV writer itself, which sometimes just ignores all newline and linedelimiter settings.

So I'm going to try to fix these issues you mention the best I can, but I cannot guarantee that rfigc.py can produce the exact same CSV files across platforms, because by design (RFC4180), CSV are made to produce different outputs per platforms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants