use bytestrings #84

milahu · 2024-03-10T14:38:54Z

simply ignore text encoding, and use raw bytestrings
dealing with text encoding is deferred to the user

this allows handling "broken" files with multiple encodings

probably this is too much change, so i merged this into pysubs2bytes

import pysubs2

subfile_path = "test.srt"

ssa_event_list = pysubs2.load(subfile_path)

for ssa_event in ssa_event_list:
    text = ssa_event.text
    if type(text) == bytes:
        text = text.decode("utf8")
    print("text", repr(text))

tkarabela · 2024-05-05T14:26:06Z

I think this is too user-hostile for the use-cases where you actually want to modify the subtitle texts. I would like to get the same effect with Unicode surrogate escapes, which should improve the situation ie. prevent some hard errors when reading subtitle files that happen today, without burdening everyone with bytes all over the place.

milahu · 2024-05-05T15:20:17Z

mixed-encoding files are a rare problem
for example caused by inserting utf8 advertisments into non-utf8 subtitles

handling mixed-encoding files requires to detect boundaries between different encodings
and for subtitles, these boundaries are the boundaries between text blocks

returning bytes instead of str could be a fallback
when the file encoding was not autodetected (#43)
or when the file has multiple encodings

but pysubs could still return strings
with a more fine-grained encoding detection
(guess encoding per textblock, not per file)

strings would be more user-friendly but less performant
for my app, im ignoring text-encoding completely
with the tradeoff of more complexity in other parts of my code (bytes regex)

Unicode surrogate escapes

that just ignores the problem of mixed-encoding files

the output will also be a mixed-encoding file
but maybe the user wants a single-encoding file

>>> "ö".encode("latin1")
b'\xf6'

>>> "ö".encode("latin1").decode("utf8", errors="surrogateescape")
'\udcf6'

>>> "ö".encode("latin1").decode("utf8", errors="surrogateescape").encode("utf8", errors="surrogateescape")
b'\xf6'

>>> print("ö".encode("latin1").decode("utf8", errors="surrogateescape"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 0: surrogates not allowed

tkarabela · 2024-05-08T22:02:12Z

I hope that mixed-encoding files are rare, since they sound like hell to deal with :) I'd like if it was possible to use the library to parse them, at least in principle, but I'm not sure how much support the library itself should provide. I think being able to get the raw bytes of the individual subtitle texts is enough - this is now possible with the latest version of my code, the user can .encode("utf8", errors="surrogateescape") the individual subtitles and then re-interpret them with a different encoding.

The performance aspect is something I haven't really thought about, and isn't my goal for the library. Your bytes-oriented approach will surely be more efficient. But at that point, using a C++/Rust/etc. library would surely be order of magnitude faster still. (Though, I don't know what libraries are available. I've thought about porting this library to Rust as a learning excercise, but I really don't have any use for it, so I haven't ^^;)

>>> print("ö".encode("latin1").decode("utf8", errors="surrogateescape"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 0: surrogates not allowed

Thanks for pointing this error, that's a bit nasty. I should note this in the documentation, if the next release is going to have surrogate escapes as default.

milahu added 3 commits March 10, 2024 14:31

use bytestrings

943f464

srt: add option keep_newlines

d030688

srt: add option keep_original_newlines

b0d731b

milahu force-pushed the bytestrings branch 2 times, most recently from b9ca41e to 319ad18 Compare March 10, 2024 18:03

microdvd: add option keep_style_tags

a270af6

milahu force-pushed the bytestrings branch from 107fbde to a270af6 Compare March 10, 2024 20:35

tkarabela self-requested a review May 5, 2024 13:54

tkarabela closed this May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use bytestrings #84

use bytestrings #84

milahu commented Mar 10, 2024 •

edited

tkarabela commented May 5, 2024

milahu commented May 5, 2024

tkarabela commented May 8, 2024

use bytestrings #84

use bytestrings #84

Conversation

milahu commented Mar 10, 2024 • edited

tkarabela commented May 5, 2024

milahu commented May 5, 2024

tkarabela commented May 8, 2024

milahu commented Mar 10, 2024 •

edited