models.text Behaviour (encoding choice) #1683

glebourgeois · 2013-10-17T16:07:01Z

I have a question about the models module behaviour, around the text method.
When no encoding is given, there is an encoding detection, which is a good thing :

if self.encoding is None:  
    encoding = self.apparent_encoding

When an encoding is given, first try use it for decoding :

try:  
    content = str(self.content, encoding, errors='replace')  
except (LookupError, TypeError):  
    content = str(self.content, errors='replace')

What I don't understand is, when an exception occurs, why it is not tried to use the self.apparent_encoding method, to do the

content = str(self.content, errors='replace')

in a more clever way.

It then would be something like that :

try:  
    content = str(self.content, encoding, errors='replace')  
except (LookupError, TypeError):
    encoding = self.apparent_encoding
    content = str(self.content, encoding, errors='replace')

Wouldn't it be better to try to detect the encoding instead of using locale one ?

The text was updated successfully, but these errors were encountered:

sigmavirus24 · 2013-10-17T16:27:06Z

With the caveat that I hope you understand this would not solve your issue because we aren't catching that SystemError, I actually think this wouldn't be a bad idea if charade weren't painfully slow. If you'd like to make charade faster, then by all means we could do this, but I think the primary concern would be one of speed.

charade is pretty accurate but that comes at the cost of being slow. There are alternative which are very fast but not as accurate. And of course, character encoding detection is a very inexact science in the first place. This is evidenced by the fact that Chrome/Chromium detected your content as ISO-8859-1 and charade as ISO-8859-2. There's no guarantee that there wouldn't be cases where that would raise an issue.

Furthermore, assuming charade gets it wrong, then it's entirely possible that we'll still see a really awful exception because that call in the except block would not catch any errors thrown by the second call. Yes we don't handle that now but that's because we should raise an exception in the event we can't decode using that either.

What catching those two would look like is:

try:
    content = str(self.content, encoding, errors='replace')
except (LookupError, TypeError):
    pass
else:
    try:
        content = str(self.content, self.apparent_encoding, errors='replace')
    except (LookupError, TypeError):
        content = str(self.content, errors='replace')

I'm not quite fond of that. And I'm not sure if it could be extracted to a function/method that would work the way we want easily.

sigmavirus24 · 2014-02-17T14:41:51Z

Closing to focus discussion on #1737

itsadok mentioned this issue Nov 14, 2013

[Suggestion] Simplify charset handling #1737

Open

sigmavirus24 closed this as completed Feb 17, 2014

github-actions bot locked as resolved and limited conversation to collaborators Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

models.text Behaviour (encoding choice) #1683

models.text Behaviour (encoding choice) #1683

glebourgeois commented Oct 17, 2013

sigmavirus24 commented Oct 17, 2013

sigmavirus24 commented Feb 17, 2014

models.text Behaviour (encoding choice) #1683

models.text Behaviour (encoding choice) #1683

Comments

glebourgeois commented Oct 17, 2013

sigmavirus24 commented Oct 17, 2013

sigmavirus24 commented Feb 17, 2014