Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 2/3 string handling #1731

Merged
merged 9 commits into from Feb 22, 2016

Conversation

mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Feb 15, 2016

To make the string handling python 2 and python 3 compatible, I

  • removed all instances of unicode and replaced it with six's text_type (which is unicode on python2, str on python3).
  • replaced basestring with six's string_types
  • replaced isinstance(var, str) with isinstance(var, string_types) or isinstance(var, string_types) and not isinstance(var, text_type), depending on the context.

@@ -163,9 +163,6 @@ def value_to_basic( self, value, app ):
return self.to_string( value, app )

def value_from_basic( self, value, app, ignore_errors=False ):
# HACK: Some things don't deal with unicode well, psycopg problem?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvdbeek Can you comment on this? Can you be confident this hack is no longer needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvdbeek Can you comment on this? Can you be confident this hack is no longer needed?

I am (working on) testing this now on a copy of our production server (which is running postgresql-9.4).
Hard to be sure what happens on old postgresql versions though.
The alternative would be

In [12]: u"tést".encode('ascii', 'replace')
Out[12]: 't?st'

or

In [4]: print(u"tést".encode('UTF-8'))
tést

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvdbeek Can you comment on this? Can you be confident this hack is no longer needed?

Doing a few standard things (running workflows, deleting datasets) it seems to work fine with postgresql-9.3 and postgresql-9.4.
pscycopg explicitly supports unicode as well, so ... I'm not sure what to do. I guess there is no chance to see what exactly was the reason for the hack :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was introduced in commit 7bfc238, but that hardly helps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is old enough that we can assume library versions are fairly different. I'd say just drop the hack for now.

@jmchilton
Copy link
Member

👍 Very nice.

xref #1715

out.append( '<tr><td>%s</td></tr>' % escape( line ) )
else:
out.append( '<tr><td>%s</td></tr>' % escape( unicode( line, 'utf-8' ) ) )
out.append( '<tr><td>%s</td></tr>' % escape( text_type( line, 'utf-8' ) ) )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work in Python 3 because text_type() is just replaced with str():

TypeError: decoding str is not supported

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it works when line is binary:

>>> text_type(b'ciao \xc3\xa8', 'utf-8')
'ciao è'

So probably it's correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we would make the least amount of assumptions by using text_type(var.encode('utf-8')).
This works in python 2 and 3.

@nsoranzo
Copy link
Member

Since unicodify() now uses six, it may make sense to add the following change to this PR:

diff --git a/lib/pulsar/client/interface.py b/lib/pulsar/client/interface.py
index 9d625f7..9b3e2a0 100644
--- a/lib/pulsar/client/interface.py
+++ b/lib/pulsar/client/interface.py
@@ -3,10 +3,7 @@ from abc import abstractmethod
 from string import Template

 from six import BytesIO
-try:
-    from six import text_type
-except ImportError:
-    from galaxy.util import unicodify as text_type
+from six import text_type
 try:
     from urllib import urlencode
 except ImportError:

return text_type( value, encoding, error )
except Exception as e:
log.debug("value %s could not be coerced to unicode" % value)
log.debug(e)
return default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation returns default if value is e.g. a list, which is wrong. I think the following implementation is more correct:

    if value is None:
        return None
    try:
        if not isinstance(value, string_types):
            value = str(value)
        # At this point value is of type str, which in Python 2 needs to be converted to unicode
        if not isinstance(value, text_type):
            value = text_type(value, encoding, error)
    except Exception:
        log.exception("value %s could not be coerced to unicode" % value)
        return default
    return value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, that makes sense. thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the above code to have also the call to text_type() inside the try.

@mvdbeek mvdbeek changed the title Python 2/3 string handling WIP: Python 2/3 string handling Feb 16, 2016
@mvdbeek
Copy link
Member Author

mvdbeek commented Feb 16, 2016

@nsoranzo thanks a lot for the suggestions and the thorough review. I hope I didn't miss anything.
In addition I switched to using text_type(var.encode('utf-8')) instead of text_type(var, 'utf-8') when var is supposed to be a string, let me know if that's a bad idea.

@mvdbeek mvdbeek changed the title WIP: Python 2/3 string handling Python 2/3 string handling Feb 16, 2016
@nsoranzo
Copy link
Member

@mvdbeek Thanks for updating the PR! I think text_type(var.encode('utf-8')) is not correct in Python 2:

>>> from __future__ import print_function
>>> from six import text_type
>>> line = 'ciao è'
>>> uline = unicode(line, 'utf-8')
>>> print("uline is '%s' of type %s" % (uline, type(uline)))
uline is 'ciao è' of type <type 'unicode'>
>>> uline = text_type(line.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)

If we slightly modify unicodify() as in https://gist.github.com/nsoranzo/6ab474f2ae74c04b0c0c, then we can use this function instead:

>>> uline = unicodify(line)
>>> print("uline is '%s' of type %s" % (uline, type(uline)))
uline is 'ciao è' of type <type 'unicode'>

return unicode( a_string, 'utf-8' )
elif a_string_type is unicode:
return a_string
return unicodify( a_string )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should specify the 'utf-8' encoding:

    return unicodify( a_string, 'utf-8' )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thanks.

else:
return val
return unicodify(val, "utf8" )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code was changing unicode to binary string, now it's doing the opposite! I think this should be:

    elif isinstance( val, text_type ):
        return val.encode( "utf8" )
    else:
        return val

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I was thinking it might be advantageous to use unicode here and here
It did prevent issue #1702 for me (in a different context, of course), ...
Let me know what you think about this, I can change it back to the original behaviour.
Also, I guess we can remove json_fix from here and import if from galaxy.util.json, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to remove json_fix from here and import if from galaxy.util.json .

I think the fix for #1702 should be different, I'll comment there.

@mvdbeek mvdbeek force-pushed the python_2_3_string_handling branch 2 times, most recently from b76bd5d to f8caa3e Compare February 18, 2016 10:35
if not isinstance( filter, basestring ):
filter = unicode( filter ).encode("utf-8")
if not isinstance( filter, string_types ):
text_type( filter ).encode("utf-8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                                filter = text_type( filter ).encode("utf-8")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsoranzo
ahh, of course.
but ... what is this code actually doing?
I don't see filter being re-used anywhere ?!
Should this be:

column_filter_encoded = [ ]
for filter in column_filter:
    if not isinstance( filter, string_types ):
        filter = text_type( filter ).encode("utf-8")
    column_filter_encoded.append( filter )
extra_url_args[ "f-" + column.key ] = dumps( column_filter_encoded )

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvdbeek You're right! But I would rewrite it as:

                        column_filter = [text_type(_).encode('utf-8') for _ in column_filter if not isinstance(_, string_types)]
                        extra_url_args[ "f-" + column.key ] = dumps( column_filter )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list comprehension was my first thought too, but in your version we're dropping filter if it is of string_type.
how about this?
column_filter = [text_type(_).encode('utf-8') if not isinstance(_, string_types) else _ for _ in column_filter]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ops, you're right!

@nsoranzo
Copy link
Member

@mvdbeek 👍 but the branch has conflicts, can you resolve them?

@mvdbeek
Copy link
Member Author

mvdbeek commented Feb 20, 2016

@nsoranzo should be OK now.

nsoranzo added a commit that referenced this pull request Feb 22, 2016
@nsoranzo nsoranzo merged commit 1b25944 into galaxyproject:dev Feb 22, 2016
@nsoranzo
Copy link
Member

Thanks @mvdbeek!

@mvdbeek
Copy link
Member Author

mvdbeek commented Feb 22, 2016

Sure, and thanks for your help @nsoranzo !

@mvdbeek mvdbeek deleted the python_2_3_string_handling branch April 8, 2016 14:57
@dannon dannon mentioned this pull request Apr 19, 2016
@nsoranzo nsoranzo added the area/python3 Specific to Python 3 label Jul 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants