-
Notifications
You must be signed in to change notification settings - Fork 965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python 2/3 string handling #1731
Python 2/3 string handling #1731
Conversation
@@ -163,9 +163,6 @@ def value_to_basic( self, value, app ): | |||
return self.to_string( value, app ) | |||
|
|||
def value_from_basic( self, value, app, ignore_errors=False ): | |||
# HACK: Some things don't deal with unicode well, psycopg problem? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvdbeek Can you comment on this? Can you be confident this hack is no longer needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvdbeek Can you comment on this? Can you be confident this hack is no longer needed?
I am (working on) testing this now on a copy of our production server (which is running postgresql-9.4).
Hard to be sure what happens on old postgresql versions though.
The alternative would be
In [12]: u"tést".encode('ascii', 'replace')
Out[12]: 't?st'
or
In [4]: print(u"tést".encode('UTF-8'))
tést
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvdbeek Can you comment on this? Can you be confident this hack is no longer needed?
Doing a few standard things (running workflows, deleting datasets) it seems to work fine with postgresql-9.3 and postgresql-9.4.
pscycopg explicitly supports unicode as well, so ... I'm not sure what to do. I guess there is no chance to see what exactly was the reason for the hack :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was introduced in commit 7bfc238, but that hardly helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is old enough that we can assume library versions are fairly different. I'd say just drop the hack for now.
👍 Very nice. xref #1715 |
out.append( '<tr><td>%s</td></tr>' % escape( line ) ) | ||
else: | ||
out.append( '<tr><td>%s</td></tr>' % escape( unicode( line, 'utf-8' ) ) ) | ||
out.append( '<tr><td>%s</td></tr>' % escape( text_type( line, 'utf-8' ) ) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work in Python 3 because text_type()
is just replaced with str()
:
TypeError: decoding str is not supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it works when line
is binary:
>>> text_type(b'ciao \xc3\xa8', 'utf-8')
'ciao è'
So probably it's correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we would make the least amount of assumptions by using text_type(var.encode('utf-8'))
.
This works in python 2 and 3.
Since
|
return text_type( value, encoding, error ) | ||
except Exception as e: | ||
log.debug("value %s could not be coerced to unicode" % value) | ||
log.debug(e) | ||
return default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation returns default
if value
is e.g. a list, which is wrong. I think the following implementation is more correct:
if value is None:
return None
try:
if not isinstance(value, string_types):
value = str(value)
# At this point value is of type str, which in Python 2 needs to be converted to unicode
if not isinstance(value, text_type):
value = text_type(value, encoding, error)
except Exception:
log.exception("value %s could not be coerced to unicode" % value)
return default
return value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, that makes sense. thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the above code to have also the call to text_type() inside the try.
@nsoranzo thanks a lot for the suggestions and the thorough review. I hope I didn't miss anything. |
@mvdbeek Thanks for updating the PR! I think
If we slightly modify
|
ea649ce
to
b593c1e
Compare
return unicode( a_string, 'utf-8' ) | ||
elif a_string_type is unicode: | ||
return a_string | ||
return unicodify( a_string ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should specify the 'utf-8' encoding:
return unicodify( a_string, 'utf-8' )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, thanks.
else: | ||
return val | ||
return unicodify(val, "utf8" ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code was changing unicode to binary string, now it's doing the opposite! I think this should be:
elif isinstance( val, text_type ):
return val.encode( "utf8" )
else:
return val
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I was thinking it might be advantageous to use unicode here and here
It did prevent issue #1702 for me (in a different context, of course), ...
Let me know what you think about this, I can change it back to the original behaviour.
Also, I guess we can remove json_fix from here and import if from galaxy.util.json, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to remove json_fix
from here and import if from galaxy.util.json
.
I think the fix for #1702 should be different, I'll comment there.
b76bd5d
to
f8caa3e
Compare
if not isinstance( filter, basestring ): | ||
filter = unicode( filter ).encode("utf-8") | ||
if not isinstance( filter, string_types ): | ||
text_type( filter ).encode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filter = text_type( filter ).encode("utf-8")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsoranzo
ahh, of course.
but ... what is this code actually doing?
I don't see filter
being re-used anywhere ?!
Should this be:
column_filter_encoded = [ ]
for filter in column_filter:
if not isinstance( filter, string_types ):
filter = text_type( filter ).encode("utf-8")
column_filter_encoded.append( filter )
extra_url_args[ "f-" + column.key ] = dumps( column_filter_encoded )
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvdbeek You're right! But I would rewrite it as:
column_filter = [text_type(_).encode('utf-8') for _ in column_filter if not isinstance(_, string_types)]
extra_url_args[ "f-" + column.key ] = dumps( column_filter )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list comprehension was my first thought too, but in your version we're dropping filter
if it is of string_type
.
how about this?
column_filter = [text_type(_).encode('utf-8') if not isinstance(_, string_types) else _ for _ in column_filter]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ops, you're right!
adce0b4
to
b55ee9a
Compare
@mvdbeek 👍 but the branch has conflicts, can you resolve them? |
b55ee9a
to
92fe009
Compare
@nsoranzo should be OK now. |
…es, unicode with six's text_type.
Thanks @mvdbeek! |
Sure, and thanks for your help @nsoranzo ! |
To make the string handling python 2 and python 3 compatible, I
unicode
and replaced it with six'stext_type
(which isunicode
on python2,str
on python3).basestring
with six'sstring_types
isinstance(var, str)
withisinstance(var, string_types)
orisinstance(var, string_types) and not isinstance(var, text_type)
, depending on the context.