Better handling of files with unknown character encoding #43

tkarabela · 2021-05-14T20:54:41Z

As of 1.2.0, we default to UTF-8 encoding. If this is not correct, the user has to specify the proper encoding manually. To improve the experience, we could try some autodetection before bailing out, to improve UX.

This is already something that users are dealing with, see:

Consider adding https://github.com/chardet/chardet as (optional?) dependency.

(This is another idea from the original pysubs library.)

The text was updated successfully, but these errors were encountered:

milahu · 2023-12-17T13:58:28Z

Consider adding https://github.com/chardet/chardet as (optional?) dependency.

chardet and libmagic guess wrong too often
i have much better experience with charset_normalizer

context: im parsing millions of subtitles from opensubtitles.org
and many old subs have non-utf8 encodings

something like...

diff --git a/pysubs2/ssafile.py b/pysubs2/ssafile.py
index 1202a46..ee22ea9 100644
--- a/pysubs2/ssafile.py
+++ b/pysubs2/ssafile.py
@@ -53,7 +53,7 @@ class SSAFile(MutableSequence):
     # ------------------------------------------------------------------------
 
     @classmethod
-    def load(cls, path: str, encoding: str="utf-8", format_: Optional[str]=None, fps: Optional[float]=None, **kwargs) -> "SSAFile":
+    def load(cls, path: str, encoding: Optional[str]="utf8", format_: Optional[str]=None, fps: Optional[float]=None, **kwargs) -> "SSAFile":
         """
         Load subtitle file from given path.
 
@@ -67,7 +67,8 @@ class SSAFile(MutableSequence):
         Arguments:
             path (str): Path to subtitle file.
             encoding (str): Character encoding of input file.
-                Defaults to UTF-8, you may need to change this.
+                Default is "utf8".
+                Set to None to autodetect the encoding.
             format_ (str): Optional, forces use of specific parser
                 (eg. `"srt"`, `"ass"`). Otherwise, format is detected
                 automatically from file contents. This argument should
@@ -100,6 +101,13 @@ class SSAFile(MutableSequence):
             >>> subs3 = pysubs2.load("subrip-subtitles-with-fancy-tags.srt", keep_unknown_html_tags=True)
 
         """
+        if encoding == None:
+            # guess encoding
+            import charset_normalizer
+            with open(path, "rb") as fp:
+                content_bytes = fp.read()
+            charset_matches = charset_normalizer.from_bytes(content_bytes)
+            encoding = str(charset_matches.best().encoding)
         with open(path, encoding=encoding) as fp:
             return cls.from_file(fp, format_, fps=fps, **kwargs)

edit

-            encoding = charset_matches.encoding
+            encoding = str(charset_matches.best().encoding)

also SSAFile.from_bytes is missing

milahu · 2024-03-10T12:16:07Z

push

this is such an easy fix...

also SSAFile.from_bytes is missing

related:
for my app, it would also be useful to ignore text encoding, and parse the raw bytes
because one subtitle file can contain multiple text encodings, for example utf8 and latin1
and in that case, no "guess encoding" library will help
see also jawah/charset_normalizer#405

tkarabela · 2024-05-05T14:20:31Z

@milahu I see your point, but I also don't like having str and non-str subtitle files... I think the answer is errors="surrogateescape", I will try to implement it for the next version of the library.

This addresses long-standing ergonomic issue #43 when dealing with files that have various or unknown character encoding. Previously, the library assumed both input and output files should be UTF-8, and it failed in case this was incorrect, forcing the user to provide appropriate character encoding. After this commit, UTF-8 is still the default input/output encoding, but default error handling changed from "strict" to "surrogateescape", ie. non-UTF-8 characters will be read into Unicode surrogate pairs which will be turned to the original non-UTF-8 characters on output. To get the previous behaviour, use `SSAFile.load(..., errors=None)` and `SSAFile.save(..., errors=None)`. For text processing, you still should specify the encoding explicitly, otherwise you will get surrogate pairs instead of non-ASCII characters when inspecting the SSAFile. Note that multi-byte encodings may still break the parser; parsing with surrogate escapes will work best with ASCII-like encodings.

tkarabela added the enhancement label May 14, 2021

milahu mentioned this issue Mar 10, 2024

use bytestrings #84

Closed

tkarabela changed the title ~~Add character encoding autodetection~~ Better handling of files with unknown character encoding May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of files with unknown character encoding #43

Better handling of files with unknown character encoding #43

tkarabela commented May 14, 2021

milahu commented Dec 17, 2023 •

edited

milahu commented Mar 10, 2024 •

edited

tkarabela commented May 5, 2024

Better handling of files with unknown character encoding #43

Better handling of files with unknown character encoding #43

Comments

tkarabela commented May 14, 2021

milahu commented Dec 17, 2023 • edited

milahu commented Mar 10, 2024 • edited

tkarabela commented May 5, 2024

milahu commented Dec 17, 2023 •

edited

milahu commented Mar 10, 2024 •

edited