Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Python 3 fixes #457

Merged
merged 24 commits into from Oct 3, 2014
Merged

More Python 3 fixes #457

merged 24 commits into from Oct 3, 2014

Conversation

astrofrog
Copy link
Member

@astrofrog
Copy link
Member Author

@ChrisBeaumont - just for info, for the categorical data, you currently convert to an object array, which np.unique can't deal with in Python 3 (numpy/numpy#5145 and numpy/numpy#641). I'll try and figure out a workaround but if you also have any ideas, let me know. It would be easy if we just had np.unique(values) since once can use list(set()) to do the filtering, but you also use return_inverse so we need to figure out how to emulate that.

@astrofrog
Copy link
Member Author

Just to check, is there a reason that we need to allow mixed type arrays for categorical data? What is the use case for this? (as opposed to just strings)

@astrofrog
Copy link
Member Author

@ChrisBeaumont - another question - in the image loader you have:

def img_loader(file_name):
    """Load an image to a numpy array, using either PIL or skimage

    :param file_name: Path of file to load
    :rtype: Numpy array
    """
    try:
        from skimage.io import imread
        return np.asarray(imread(file_name))
    except ImportError:
        pass

    try:
        from PIL import Image
        return np.asarray(Image.open(file_name))
    except ImportError:
        raise ImportError("Reading %s requires PIL or scikit-image" %
                          file_name)

scikit-image is just interfacing to matplotlib or PIL. If it uses matplotlib, then it actually returns the wrong result (it returns a floating point array normalized to 1). Can we simply remove the first try...except and use just PIL/pillow? (since scikit-image is just using PIL)

@ChrisBeaumont
Copy link
Member

Requiring string arrays for categorical data might be fine -- @JudoWill developed this feature, so he should confirm. But I think that should be general enough.

@astrofrog
Copy link
Member Author

@ChrisBeaumont - ok, and we can also cast to string in case there are any non-string values in the data, to be a little more flexible.

Regarding the scikit-image, the relevant issue where I found out about the different plugins is here: scikit-image/scikit-image#1187

Basically I don't think there is any benefit to using scikit-image because it doesn't actually implement it, it's just a wrapper around matplotlib and PIL.

@astrofrog
Copy link
Member Author

For scikit-image I can also simply check the output type and if the output is float then transform it to the uint representation for images (as described here: http://scikit-image.org/docs/0.10.x/user_guide/data_types.html)

@ChrisBeaumont
Copy link
Member

Regarding the skimage try/except block: I suspect I thought skimage was able to parse more formats than PIL can. In principle this is true, since it has a plugin system akin to astropy. But I see the only default plugins are PIL and matplotlib, so skimage isn't adding anything in the normal case. No strong opinions from me

@astrofrog
Copy link
Member Author

Ok, I'll see what I can do tomorrow. I might leave it but add a call to img_as_ubyte to force it to uint8.

@astrofrog
Copy link
Member Author

@JudoWill - I've updated the CategoricalData class to convert categories to strings but recognize nan and None as missing values, and converting the categories to '' for these. Can you review this change and let me know if it sounds ok to you?

@astrofrog
Copy link
Member Author

@ChrisBeaumont - I have all the tests passing in Python 3 except those in glue/qt so in the spirit of moving forward and avoiding regressions, I've updated Travis to test everything except glue/qt on Python 3. This means that at least we won't introduce Python 3 regressions in most places. I will address glue/qt in a separate PR. The things to check before merging are:

  • that the treatment of categorical data is ok
  • the question regarding the inequality in a9ffe34
  • the change to the plotly exporter and test - basically this was relying on dictionary order which is much more random in Python 3 so it uncovered the error (I think it was just lucky that it never failed in Python 2).

Other than that, this is good to go.

@astrofrog astrofrog changed the title WIP: More Python 3 fixes More Python 3 fixes Oct 3, 2014
@JudoWill
Copy link
Contributor

JudoWill commented Oct 3, 2014

@astrofrog Originally I implemented categorical variables as 'any' object (in some cases one might want to treat an integer as a category) but as I use it more and more, I find that I only ever use strings. If its a blocker for Py3 compatibility I'm more then willing to give up the "generic" nature of categories.

Can you put a type-check in the __init__ (or really anywhere) to make sure someone only ever passes strings, I much prefer this over silent conversion. Maybe raising a NotImplementedError or a simple AssertionError would be sufficient.

@astrofrog
Copy link
Member Author

@JudoWill - thanks for the quick reply! I'll implement this. In future, the Numpy bug should be fixed of course, so I'll add a TODO to remind us to change it back if we ever need. Having said that, to plot categorical data we need to convert to strings anyway, so maybe requiring strings is not much of an issue.

@@ -131,7 +138,7 @@ def __init__(self, dc, parent):

@memoize
def child(self, row):
if row < self.dc.subset_groups:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug in the original code -- the test should be if row < len(self.dc.subset_groups) (integer to integer comparison)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes more sense!

@astrofrog
Copy link
Member Author

@JudoWill @ChrisBeaumont - actually we don't have to be so strict, we just can't allow mixed types. But we can have integer or boolean categorical data. I've updated the code to do the following:

  • If the input is an array, check that it is not of type object
  • If the input is another iterable, check that all objects are the same type

@ChrisBeaumont - all that's left now is to check my question about the inequality and the plotly change

@astrofrog
Copy link
Member Author

@ChrisBeaumont - I've addressed your comments. As mentioned above, pyside doesn't install with Python 3.4 so I'm installing pyside on Python 2, and pyqt on Python 3 - does that work for you?

# Check that categorical data is of uniform type
if isinstance(categorical_data, np.ndarray):
if categorical_data.dtype.kind == "O":
raise TypeError("Numpy object type not supported")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I agree with @JudoWill about raising an error here. If a user passes a pandas series of strings into glue, it will be of dtype=object. Raising an error for that case is annoying.

Actually, if the only issue with mixed types is the call to numpy.unique, it looks like pandas.factorize handles mixed types in python3:

In [19]: x = np.array([1, 1, 3, 'hi', 'hi', 3], dtype=object)

In [20]: pd.factorize(x)
Out[20]: (array([0, 0, 1, 2, 2, 1]), array([1, 3, 'hi'], dtype=object))

In [21]: np.unique(x)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-9a478241c930> in <module>()
----> 1 np.unique(x)

/Users/beaumont/anaconda/envs/astropy3/lib/python3.3/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts)
    193         aux = ar[perm]
    194     else:
--> 195         ar.sort()
    196         aux = ar
    197     flag = np.concatenate(([True], aux[1:] != aux[:-1]))

TypeError: unorderable types: str() > int()

Since pandas is a required dependency of Glue anyways, I know think it might be best to change the _update_categories method, and keep mixed types unless there's another roadblock.

I can make this change if you want (we probably need to use pandas instead of np.searchsorted in _update_data as well)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisBeaumont - I'm not too familiar with Pandas so feel free to make the change. If possible I think it would probably make sense to have a compatibility 'unique' function in the utilities that can be called from here, to limit the number of places where we are dependent on pandas (and in future we can switch the compatibility layer back to using Numpy once they have fixed it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisBeaumont - Actually, are you sure it doesn't already work as you would like? If you pass a pandas object it will currently not validate as a numpy array so it will iterate over them to check that they are all strings, then make a string array out of them. In fact, I think there are already some tests that use pandas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might, I didn't check. But I opened a PR on your copy that should deal with the problems of mixed types more directly.

Chris Beaumont and others added 2 commits October 3, 2014 11:02
@astrofrog
Copy link
Member Author

@ChrisBeaumont - I didn't have travis set up on my fork so didn't see there were failures. I should be able to address them.

@ChrisBeaumont
Copy link
Member

Ah, sorry about that

@astrofrog
Copy link
Member Author

@ChrisBeaumont - the build is passing now! Feel free to merge :) I'll then rebase #458.

@ChrisBeaumont
Copy link
Member

Great, thanks!

ChrisBeaumont added a commit that referenced this pull request Oct 3, 2014
@ChrisBeaumont ChrisBeaumont merged commit 062fcb0 into glue-viz:master Oct 3, 2014
astrofrog added a commit that referenced this pull request Oct 4, 2014
Final round of Python 3 fixes (requires #457 to be merged first)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants