New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte #577
Comments
I found a solution to fetch the options : states = set()
try:
for key in kid.AP.N:
states.update([key])
except UnicodeDecodeError as e:
states.update([e.object.decode('iso-8859-1')]) But I have another question, I would like to fill the radio with a value containing accentuated characters. I tried this way : values = {}
try:
for key in kid.AP.N:
values[key] = key
except UnicodeDecodeError as e:
values[e.object.decode('iso-8859-1')] = e.object
if str(value) in values.keys():
value = Name(str(values[str(value)], encoding='pdfdoc', errors='replace'))
kid.AS=value
elif '/AS' in kid:
kid.AS=Name.Off but it does not work I also tried to use the "original" value : Name(values[str(value)]) it does not work as well because its bytes Name(str(values[str(value)], encoding='pdfdoc')) also does not work because for python it does not start with / Do you have any ideas of how to do that ? |
When I comment out the part that refuses bytes like this https://github.com/pikepdf/pikepdf/blob/main/src/pikepdf/objects.py#L101 I can successfully set the radio button value using the "original" value :
Is there a reason why bytes are not allowed ? If there is what is the support method to update a field with a value containing accentuated characters ? |
The input PDF is malformed in a way that pikepdf cannot correct for. Analysis In PDF, Dictionary objects are key value maps like Python dict, except that the key is restricted in that it must be a PDF Name object. A Name object is denoted by beginning with a / and what follows must encoded in a specific way. A Name cannot store arbitrary bytes; specifically it cannot store the null character. The process of encoding a Name is:
So the expected encoding of Célibataire as seen in a hex editor should be: >>> re.sub(
br'[^\x21-\x7e]', lambda m: (b'#' + hex(ord(m.group(0))).upper()[2:].encode()), 'Célibataire'.encode('utf-8')
) # this regex does not handle all cases of encoding, I just developed it in exploring the issue
b'C#C3#A9libataire' The error messages suggests your file has one of these two encoding errors
Big picture
I can see this is frustrating, but it looks like the input is a malformed file that needs forensic repair. Design note For better or for worse, pikepdf automatically converts str to pikepdf.Name when you interact with pikepdf.Dictionary, which means for malformed input files you get difficult exceptions like in this issue. In retrospect this was probably a design mistake on my part - in a way it is analogous to automatic str/bytes conversion in Python 2. Initially I wanted to make porting pypdf2 code to pikepdf as easy as possible, and pypdf2 does the same.. At some point I will probably deprecate this and force pikepdf.Dictionary to function as strictly a |
Calling
annotation.AP.N.keys()
on radio buttons with options containing accentuated characters such asé
,è
,ê
, etc throws the following error :Should the options in src/core/object.cpp:610 :
be encoded before returning them or there is a way to encode them in the same time as calling the function in python so I can get all the available options ?
As an example on a radio button I have the following options :
Actually I can do the following :
And I have these logs :
The text was updated successfully, but these errors were encountered: