Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

models.text Behaviour (encoding choice) #1683

Closed
glebourgeois opened this issue Oct 17, 2013 · 2 comments
Closed

models.text Behaviour (encoding choice) #1683

glebourgeois opened this issue Oct 17, 2013 · 2 comments

Comments

@glebourgeois
Copy link

I have a question about the models module behaviour, around the text method.
When no encoding is given, there is an encoding detection, which is a good thing :

if self.encoding is None:  
    encoding = self.apparent_encoding

When an encoding is given, first try use it for decoding :

try:  
    content = str(self.content, encoding, errors='replace')  
except (LookupError, TypeError):  
    content = str(self.content, errors='replace')

What I don't understand is, when an exception occurs, why it is not tried to use the self.apparent_encoding method, to do the

content = str(self.content, errors='replace') 

in a more clever way.

It then would be something like that :

try:  
    content = str(self.content, encoding, errors='replace')  
except (LookupError, TypeError):
    encoding = self.apparent_encoding
    content = str(self.content, encoding, errors='replace')

Wouldn't it be better to try to detect the encoding instead of using locale one ?

@sigmavirus24
Copy link
Contributor

With the caveat that I hope you understand this would not solve your issue because we aren't catching that SystemError, I actually think this wouldn't be a bad idea if charade weren't painfully slow. If you'd like to make charade faster, then by all means we could do this, but I think the primary concern would be one of speed.

charade is pretty accurate but that comes at the cost of being slow. There are alternative which are very fast but not as accurate. And of course, character encoding detection is a very inexact science in the first place. This is evidenced by the fact that Chrome/Chromium detected your content as ISO-8859-1 and charade as ISO-8859-2. There's no guarantee that there wouldn't be cases where that would raise an issue.

Furthermore, assuming charade gets it wrong, then it's entirely possible that we'll still see a really awful exception because that call in the except block would not catch any errors thrown by the second call. Yes we don't handle that now but that's because we should raise an exception in the event we can't decode using that either.

What catching those two would look like is:

try:
    content = str(self.content, encoding, errors='replace')
except (LookupError, TypeError):
    pass
else:
    try:
        content = str(self.content, self.apparent_encoding, errors='replace')
    except (LookupError, TypeError):
        content = str(self.content, errors='replace')

I'm not quite fond of that. And I'm not sure if it could be extracted to a function/method that would work the way we want easily.

@sigmavirus24
Copy link
Contributor

Closing to focus discussion on #1737

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants