Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of files with unknown character encoding #43

Open
tkarabela opened this issue May 14, 2021 · 3 comments
Open

Better handling of files with unknown character encoding #43

tkarabela opened this issue May 14, 2021 · 3 comments

Comments

@tkarabela
Copy link
Owner

As of 1.2.0, we default to UTF-8 encoding. If this is not correct, the user has to specify the proper encoding manually. To improve the experience, we could try some autodetection before bailing out, to improve UX.

This is already something that users are dealing with, see:

Consider adding https://github.com/chardet/chardet as (optional?) dependency.

(This is another idea from the original pysubs library.)

@milahu
Copy link

milahu commented Dec 17, 2023

Consider adding https://github.com/chardet/chardet as (optional?) dependency.

chardet and libmagic guess wrong too often
i have much better experience with charset_normalizer

context: im parsing millions of subtitles from opensubtitles.org
and many old subs have non-utf8 encodings

something like...

diff --git a/pysubs2/ssafile.py b/pysubs2/ssafile.py
index 1202a46..ee22ea9 100644
--- a/pysubs2/ssafile.py
+++ b/pysubs2/ssafile.py
@@ -53,7 +53,7 @@ class SSAFile(MutableSequence):
     # ------------------------------------------------------------------------
 
     @classmethod
-    def load(cls, path: str, encoding: str="utf-8", format_: Optional[str]=None, fps: Optional[float]=None, **kwargs) -> "SSAFile":
+    def load(cls, path: str, encoding: Optional[str]="utf8", format_: Optional[str]=None, fps: Optional[float]=None, **kwargs) -> "SSAFile":
         """
         Load subtitle file from given path.
 
@@ -67,7 +67,8 @@ class SSAFile(MutableSequence):
         Arguments:
             path (str): Path to subtitle file.
             encoding (str): Character encoding of input file.
-                Defaults to UTF-8, you may need to change this.
+                Default is "utf8".
+                Set to None to autodetect the encoding.
             format_ (str): Optional, forces use of specific parser
                 (eg. `"srt"`, `"ass"`). Otherwise, format is detected
                 automatically from file contents. This argument should
@@ -100,6 +101,13 @@ class SSAFile(MutableSequence):
             >>> subs3 = pysubs2.load("subrip-subtitles-with-fancy-tags.srt", keep_unknown_html_tags=True)
 
         """
+        if encoding == None:
+            # guess encoding
+            import charset_normalizer
+            with open(path, "rb") as fp:
+                content_bytes = fp.read()
+            charset_matches = charset_normalizer.from_bytes(content_bytes)
+            encoding = str(charset_matches.best().encoding)
         with open(path, encoding=encoding) as fp:
             return cls.from_file(fp, format_, fps=fps, **kwargs)
 

edit

-            encoding = charset_matches.encoding
+            encoding = str(charset_matches.best().encoding)

also SSAFile.from_bytes is missing

@milahu
Copy link

milahu commented Mar 10, 2024

push

this is such an easy fix...

also SSAFile.from_bytes is missing

related:
for my app, it would also be useful to ignore text encoding, and parse the raw bytes
because one subtitle file can contain multiple text encodings, for example utf8 and latin1
and in that case, no "guess encoding" library will help
see also jawah/charset_normalizer#405

@milahu milahu mentioned this issue Mar 10, 2024
@tkarabela tkarabela changed the title Add character encoding autodetection Better handling of files with unknown character encoding May 5, 2024
@tkarabela
Copy link
Owner Author

@milahu I see your point, but I also don't like having str and non-str subtitle files... I think the answer is errors="surrogateescape", I will try to implement it for the next version of the library.

tkarabela added a commit that referenced this issue May 5, 2024
This addresses long-standing ergonomic issue #43 when dealing with
files that have various or unknown character encoding. Previously,
the library assumed both input and output files should be UTF-8,
and it failed in case this was incorrect, forcing the user to provide
appropriate character encoding.

After this commit, UTF-8 is still the default input/output encoding,
but default error handling changed from "strict" to "surrogateescape",
ie. non-UTF-8 characters will be read into Unicode surrogate pairs which
will be turned to the original non-UTF-8 characters on output.

To get the previous behaviour, use `SSAFile.load(..., errors=None)` and
`SSAFile.save(..., errors=None)`.

For text processing, you still should specify the encoding explicitly,
otherwise you will get surrogate pairs instead of non-ASCII characters
when inspecting the SSAFile.

Note that multi-byte encodings may still break the parser; parsing with
surrogate escapes will work best with ASCII-like encodings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants