New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More Python 3 fixes #457
More Python 3 fixes #457
Conversation
@ChrisBeaumont - just for info, for the categorical data, you currently convert to an object array, which |
Just to check, is there a reason that we need to allow mixed type arrays for categorical data? What is the use case for this? (as opposed to just strings) |
@ChrisBeaumont - another question - in the image loader you have:
scikit-image is just interfacing to matplotlib or PIL. If it uses matplotlib, then it actually returns the wrong result (it returns a floating point array normalized to 1). Can we simply remove the first try...except and use just PIL/pillow? (since scikit-image is just using PIL) |
Requiring string arrays for categorical data might be fine -- @JudoWill developed this feature, so he should confirm. But I think that should be general enough. |
@ChrisBeaumont - ok, and we can also cast to string in case there are any non-string values in the data, to be a little more flexible. Regarding the scikit-image, the relevant issue where I found out about the different plugins is here: scikit-image/scikit-image#1187 Basically I don't think there is any benefit to using scikit-image because it doesn't actually implement it, it's just a wrapper around matplotlib and PIL. |
For scikit-image I can also simply check the output type and if the output is float then transform it to the uint representation for images (as described here: http://scikit-image.org/docs/0.10.x/user_guide/data_types.html) |
Regarding the skimage try/except block: I suspect I thought skimage was able to parse more formats than PIL can. In principle this is true, since it has a plugin system akin to astropy. But I see the only default plugins are PIL and matplotlib, so skimage isn't adding anything in the normal case. No strong opinions from me |
Ok, I'll see what I can do tomorrow. I might leave it but add a call to |
…t with mixed float and string types
…to indicate no category
@JudoWill - I've updated the CategoricalData class to convert categories to strings but recognize nan and None as missing values, and converting the categories to |
@ChrisBeaumont - I have all the tests passing in Python 3 except those in
Other than that, this is good to go. |
@astrofrog Originally I implemented categorical variables as 'any' object (in some cases one might want to treat an integer as a category) but as I use it more and more, I find that I only ever use strings. If its a blocker for Py3 compatibility I'm more then willing to give up the "generic" nature of categories. Can you put a type-check in the |
@JudoWill - thanks for the quick reply! I'll implement this. In future, the Numpy bug should be fixed of course, so I'll add a TODO to remind us to change it back if we ever need. Having said that, to plot categorical data we need to convert to strings anyway, so maybe requiring strings is not much of an issue. |
@@ -131,7 +138,7 @@ def __init__(self, dc, parent): | |||
|
|||
@memoize | |||
def child(self, row): | |||
if row < self.dc.subset_groups: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug in the original code -- the test should be if row < len(self.dc.subset_groups)
(integer to integer comparison)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that makes more sense!
@JudoWill @ChrisBeaumont - actually we don't have to be so strict, we just can't allow mixed types. But we can have integer or boolean categorical data. I've updated the code to do the following:
@ChrisBeaumont - all that's left now is to check my question about the inequality and the plotly change |
@ChrisBeaumont - I've addressed your comments. As mentioned above, pyside doesn't install with Python 3.4 so I'm installing pyside on Python 2, and pyqt on Python 3 - does that work for you? |
# Check that categorical data is of uniform type | ||
if isinstance(categorical_data, np.ndarray): | ||
if categorical_data.dtype.kind == "O": | ||
raise TypeError("Numpy object type not supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I agree with @JudoWill about raising an error here. If a user passes a pandas series of strings into glue, it will be of dtype=object. Raising an error for that case is annoying.
Actually, if the only issue with mixed types is the call to numpy.unique, it looks like pandas.factorize handles mixed types in python3:
In [19]: x = np.array([1, 1, 3, 'hi', 'hi', 3], dtype=object)
In [20]: pd.factorize(x)
Out[20]: (array([0, 0, 1, 2, 2, 1]), array([1, 3, 'hi'], dtype=object))
In [21]: np.unique(x)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-21-9a478241c930> in <module>()
----> 1 np.unique(x)
/Users/beaumont/anaconda/envs/astropy3/lib/python3.3/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts)
193 aux = ar[perm]
194 else:
--> 195 ar.sort()
196 aux = ar
197 flag = np.concatenate(([True], aux[1:] != aux[:-1]))
TypeError: unorderable types: str() > int()
Since pandas is a required dependency of Glue anyways, I know think it might be best to change the _update_categories
method, and keep mixed types unless there's another roadblock.
I can make this change if you want (we probably need to use pandas instead of np.searchsorted in _update_data
as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChrisBeaumont - I'm not too familiar with Pandas so feel free to make the change. If possible I think it would probably make sense to have a compatibility 'unique' function in the utilities that can be called from here, to limit the number of places where we are dependent on pandas (and in future we can switch the compatibility layer back to using Numpy once they have fixed it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChrisBeaumont - Actually, are you sure it doesn't already work as you would like? If you pass a pandas object it will currently not validate as a numpy array so it will iterate over them to check that they are all strings, then make a string array out of them. In fact, I think there are already some tests that use pandas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might, I didn't check. But I opened a PR on your copy that should deal with the problems of mixed types more directly.
Use pandas to deal with categorical lookups on mixed-type arrays
@ChrisBeaumont - I didn't have travis set up on my fork so didn't see there were failures. I should be able to address them. |
0e55311
to
959b212
Compare
Ah, sorry about that |
@ChrisBeaumont - the build is passing now! Feel free to merge :) I'll then rebase #458. |
Great, thanks! |
Final round of Python 3 fixes (requires #457 to be merged first)
This is a work in progress, and I am aiming to try and get the test suite to pass in Python 3.
Related issues:
imread returns different results on Python 2 and Python 3 scikit-image/scikit-image#1187(jeez, this is turning out to be quite the bug hunt)