Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unique value classifier for categorical maps with distinct colors for large number of categories. #173

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

sjsrey
Copy link
Member

@sjsrey sjsrey commented Feb 9, 2023

image

@sjsrey sjsrey added the WIP Work in progress. For discussion and feedback. Do not merge. label Feb 9, 2023
@sjsrey sjsrey requested a review from jGaboardi February 9, 2023 00:49
@jGaboardi
Copy link
Member

Does distinctipy need to be added to the .ci/ environments?

@jGaboardi
Copy link
Member

  • Looks likes the env for 310 wasn't updated.
  • Once we get CI passing we'll probably want a test(s) and add the notebook to the tutorials list.

@codecov
Copy link

codecov bot commented Feb 9, 2023

Codecov Report

Merging #173 (f7476b1) into main (3c2bb92) will decrease coverage by 2.3%.
The diff coverage is 21.1%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main    #173     +/-   ##
=======================================
- Coverage   88.5%   86.2%   -2.3%     
=======================================
  Files          8       8             
  Lines       1070    1108     +38     
=======================================
+ Hits         947     955      +8     
- Misses       123     153     +30     
Impacted Files Coverage Δ
mapclassify/__init__.py 100.0% <ø> (ø)
mapclassify/classifiers.py 85.2% <21.1%> (-2.7%) ⬇️
mapclassify/greedy.py 92.1% <0.0%> (ø)

@martinfleis
Copy link
Member

To be fair, I don't really see a need for it, especially in mapclassify. It doesn't do anything on the classification front and the only possible benefit over calling distinctipy directly in geopandas is a custom legend with counts. If you are interested only in a categorical plot with N distinct colours, this will do the trick.

gdf.plot('STATE_NAME', cmap=distinctipy.get_colormap(distinctipy.get_colors(gdf.STATE_NAME.nunique())))

If the main functionality you are interested in here is the plot method, then this should live in splot, not here.

Copy link
Member

@knaaptime knaaptime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be super useful for a lot of applications (e.g. like our neighborhood delineation over in geosnap where you can a few hundred neighborhoods in a single metro, but the mpl colormaps dont give enough variation)

I see your point Martin, but i'd just add that IMO the utility of geopandas using mapclassify under the hood is that i dont need to know or care about distinctpy as a user (and i definitely dont want to have to remember cmap=distinctipy.get_colormap(distinctipy.get_colors(gdf.STATE_NAME.nunique())))) even if that's sufficient. Since mapclassify has applications beyond geopandas, and this is a super useful classification method, it feels like an obvious enhancement to me

@martinfleis
Copy link
Member

But you will not use this under the hood from geopandas and if the main point is exposure of distintcipy, then it should be in splot. I just don't think this belongs to mapclassify and is not consistent with the rest of the package.

@knaaptime
Copy link
Member

(well, i definitely would use this under the hood in geopandas lots of times, cause i dont wanna type that long string. That's like saying the quantiles class shouldnt be available because you can always do df.assign(col=othercol.quantile()).plot(col)

so i think this is the rub:

imo, the purpose of mapclassify is to create binning schemes that are appropriate for (a wide variety) of cartographic display. In the case of unique values, it's true that you dont need to classify those values, but mapclassify still exists to provide an appropriate binning scheme for mapping those data (without requiring the cartogrtapher to know additional libraries). And there are lots of cases (e.g. landuse classification) where these data and mapping are prevalent

the purpose of splot is for spatial statistical visualizations which is why the esda plotting methods live there (and why that stuff isnt in mapclassify in the first place)

so i guess what im saying is its far more natural for a 'unique binner' to live in the binning package, rather than our version of seaborn

@martinfleis
Copy link
Member

Alright, let me elaborate a bit as I think that my comments may have come across as a bit too harsh.

I think that this is super useful feature to have when I need to plot categorical variables with more than 20 classes supported natively by geopandas. And if exposed in geopandas, I would also use it myself. However, as implemented here it is not compatible with being exposed in geopandas. That is what I meant. We use mapclassify under the hood via the scheme keyword consuming primarily .bins and .yb to create a categorical variable that is then processed by standard categorical plotting. mapclassify.UniqueValue would then only give us the same information we already have natively in the GeoDataFrame and the main point of plotting with N distinct colour would not get through. So passing scheme="unique_value" and categorical=True, or nothing in case od non-numerical column will result in the same plot. The mapclassify.UniqueValue .plot will not be used there and I can't think of any reasonable way of doing so.

If we want to use this from geopandas, then the reasonable thing would be to open an issue there resulting in a PR ensuring that you can pass something like cmap="distinct" to any categorical plot that would call distinctipy under the hood.

Now onto second point. The point of mapclassify is to discretise continuous variable into a set of classes. I am fine expanding that logic to categorical variables if we think it is useful in some way. But the output is always an array (bins, labels...). And that is consistent across the package. mapclassify.UniqueValue does that as well (though with a questionable value) but on top of that implements something that the package does not have anywhere else - plotting. And it does that only to wrap distinctipy into a more friendly method. Which is inconsistent, it has no precedent in here and feels like an alien part of the codebase within mapclassify.

If there is any functionality in PySAL that is remotely close to this type of choropleth plotting it is splot.vba_choropleth. We can discuss if that belongs to splot given it is aimed at spatial statistical visualizations as you say but it is there, so there is a precedent.

We've been discussing the mess we have with plotting weights (one method in splot, other in libpysal) and that it should be consistently implemented in splot, so I don't want to create yet another place where we have some plotting code.

As a conclusion - if we, as a community, think that it would be useful to have a direct access to N-colored cmap when plotting a categorical variable from a GeoDataFrame, let's open an issue in geopandas and implement it there, where it would belong most naturally. If we also think that having the counts in the legend is important, it may also be included there. The same code can then be shared with the explore method to give it even better visibility.

If you all think that it is okay to implement it as is in mapclassify and that it is the best place for this functionality, I'll accept that. But at the moment I am just not convinced of that.

@knaaptime
Copy link
Member

one of my fav parts of the dev process is having these discussions to make decisions by committee :D

@sjsrey
Copy link
Member Author

sjsrey commented Feb 10, 2023

This discussion is what I hoped the WIP label would stimulate, so I think this is very productive.

The original motivation for this came from a user of mapclassify who asked for this ability. My pr is intended to show how this might be done - I am uncertain myself where this actually should live - I can see merits in all the the options that have been suggested thus far. A couple of thoughts:

Geopandas consumption

@martinfleis is correct that the current implementation cannot be exposed in geopandas as Unique_Value is not a subclass of MapClassifier. This was done because the classes for UV do not have bounds/intervals, just labels.

We could refactor this to extend the legend handling in mapclassify to deal with the continuous and catgorical variables in a more elegant fashion. If so, then exposing this in geopandas should be possible with the existing api.

Alternatively, we could do a PR into geopandas to add this functionality directly (i.e., it wouldn't be a classifier in mapclassify).

api inconsistency

Yes, since Unique_Value jettisons the inheritance in mapclassify it is not consistent with the other classifiers. The addition of the plot method also marks a departure. The latter was intended to flesh out the plotting issues/design more so than to suggest we add a plot method to all the classifiers in mapclassify (although this gets asked for from time to time).

For plotting code in pysal, I agree it is best to centralize that logic in splot. other packages can consume that api but the consistency should come through splot. I'm not against giving the different packages their own plot methods where it makes sense, as long as the are composed through splot to the extent possible to ensure consistsency.

@knaaptime
Copy link
Member

😁 @martinfleis I dont read you as harsh, we're both just direct writers with opinions

But the output is always an array (bins, labels...). And that is consistent across the package. mapclassify.UniqueValue does that as well (though with a questionable value) but on top of that implements something that the package does not have anywhere else - plotting. And it does that only to wrap distinctipy into a more friendly method. Which is inconsistent, it has no precedent in here and feels like an alien part of the codebase within mapclassify.

my view is this is a philosophical distinction. Packages evolve over time and this function is designed to help make good looking maps by putting data into bins--which is precisely the purpose of mapclassify

I guess the question is how you view the categorization of functionality across the packages. I'd argue the conceptual difference between the package is more important. The "precedent" here is providing simple tools for creating good looking maps. It doesnt matter how the code works. Implicitly, mapclassify is about creating choropleths, regardless of whether we've done the actual plotting in the past. In this case, it makes sense to go ahead and implement the plotting because that's the best way to surface the functionality we're actually after with the package--making attractive maps easily

although the value-by-alpha stuff exists in splot, the package itself is not about choropleths. It's about wrapping tailored visualizations around spatial analyses. So, personally, I see no precedent for this function over there

if this function is a crayon, then it makes more sense to me inside the box of pencils (mapclassify) than the box of protractors (splot)

@martinfleis
Copy link
Member

We could refactor this to extend the legend handling in mapclassify to deal with the continuous and catgorical variables in a more elegant fashion. If so, then exposing this in geopandas should be possible with the existing api.

That is only partially true. We would be able to pass labels to geopandas but not colours. And passing labels is a bit pointless given geopandas can do that itself and will call pd.Categorical on those labels anyway.

Alternatively, we could do a PR into geopandas to add this functionality directly (i.e., it wouldn't be a classifier in mapclassify).

Would any of you object this option? It is imho the best one. I would add "distinct" as a special case for the cmap argument to expose distinctipy colormap and additional options to legend_kwds controlling when to show counts in the legend (opt-in) and whether to sort that.

@sjsrey
Copy link
Member Author

sjsrey commented Feb 10, 2023

Would any of you object this option? It is imho the best one. I would add "distinct" as a special case for the cmap argument to expose distinctipy colormap and additional options to legend_kwds controlling when to show counts in the legend (opt-in) and whether to sort that.

I'm leaning this way, as I think it is the cleanest solution.

I can think of a couple of additional options that might be useful, but I could add them to the PR in GP if this the way we go.

@martinfleis
Copy link
Member

I would start with an issue outlining the idea in the geopandas repo to gather feedback from folks there. We may hit resistance (I don't think we will) and circle back here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Work in progress. For discussion and feedback. Do not merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants