Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN attribute values cause weird behaviour with Local Moran's I and Local Geary #215

Open
waeiski opened this issue Aug 11, 2022 · 4 comments

Comments

@waeiski
Copy link

waeiski commented Aug 11, 2022

I am not sure whether what I've encountered is a bug or expected behaviour with data like mine.

I am trying to perform a Local Moran's I analysis on a population variable in a vector file that consists of 250 meter grids. The output gives every cell a pseudo p-value of 0.0001, a classification 3 (or 'LL'), and the Moran's I value for every grid cell are NaN. The grid vector file does not contain any islands, but a majority of the cells have NaN values as not all cells are populated. The spatial weights are constructed without errors and this behaviour happens regardless of the type of spatial weights (Queen, Rook and KNN). Same happens if I switch to local Getis-Ord, the G-values are all NaN. I think that esda's handling of NaN/None values might need to be checked. I could not find any mention on how I am supposed to deal with vector files with many NaNs in the attribute fields in the PySAL tutorials, or libpysal/esda documentation, so I am not 100% sure whether this is a bug or expected behaviour with this type of data.

For what it is worth, GeoDA does not produce the same weird result with the exact same data. GeoDA classifies cells with NaN values as "Undefined", so the problem shouldn't be with my data.

Last but not least, I am running PySAL 2.3.0 with esda 2.3.1. The older versions are because, I work in a remote "secure computing environment" so updating anything costs money. However, I am not sure whether this behaviour is the same that was solved in: #196. If this behaviour has been solved in that PR or elsewhere, I guess I have to dish out some project funding.

@ljwolf
Copy link
Member

ljwolf commented Aug 11, 2022

Hey, thanks for the report!

In the past, we had agreed as a library to not handle missing values. The solution you link to (#196) would not "solve" the problem you're raising, since it would simply error by saying you have NaN/None values. So, I don't think upgrading will do what you want.

The "best" solution would probably be for us to set all "missing value" rows of the weights matrix to Zero, so that they never enter into any of the computations. I'm not sure if that would have ramifications down the line at the moment, but you can try that.

A different easy workaround for you would be to replace any missing values with their lagged value, and then mask them out again afterwards. So,

from pysal.lib import weights
import numpy
import esda

y_smoothed = numpy.copy(y)
nan_values = numpy.isnan(y)
y_smoothed[nan_values] = weights.lag_spatial(W, y)[nan_values]

lmo = esda.Moran_Local(y_smoothed, W)

Is = numpy.copy(lmo.Is)
Is[nan_values] = numpy.nan
p_sim = numpy.copy(lmo.p_sim)
p_sim[nan_values] = numpy.nan
q = numpy.copy(lmo.q)
q = q.astype(str)
q[nan_values] = "Undefined"

@darribas
Copy link
Member

To add to Levi's suggestion, a couple of thoughts:

  1. You say some cells are "unpopulated". Does that mean the value is actually 0? If so you could fill that in and run the LISA on the fully dense dataset?
  2. If those are not zero, it might be more efficient to remove them from the data before computation. Two ways:

a. Remove them before building your weights

# Using Levi's notation and objects where possible and assuming the geographies and y are part of 
# a GeoDataFrame called `db`
w_sub = weights.Queen.from_dataframe(db.loc[~nan_values, :])

lmo = esda.Moran_Local(y[~nan_values], w_sub)

b. Subset w after creation

w_sub = weights.w_subset(w, y[~nan_values].index)

lmo = esda.Moran_Local(y[~nan_values], w_sub)

The above assumes you built w from a table indexed as y (e.g., if y was a column in db).

These two approaches are equivalent to computing LISA on a geography with "holes", which is a slightly different assumption than you would by smoothing as Levi suggests. Both are second bests, but in slightly different way. Which one is preferrable I'd say depends on your use case?

@waeiski
Copy link
Author

waeiski commented Aug 12, 2022

Thanks @darribas @ljwolf for the suggestions and clarifications. I will try setting to zero, masking and removing and see how they differ in terms of results. If it helps the development or documentation work of PySAL, I can report back the differences here.

As for Dani's questions: You are correct. The cells with NaNs are unpopulated so 0 people live in these cells (e.g. industrial areas, forests, lakes, sea etc.). I'll see whether populating the NaNs with 0's causes population e.g. "cold spots" to occur in these unpopulated areas, which in a sense is correct, but it is not what I am after. I would like the cold spots to indicate cells that have few inhabitants instead of none at all.

@darribas
Copy link
Member

I would like the cold spots to indicate cells that have few inhabitants instead of none at all.

I'd then probably go for subsetting to only those you know have population as that is your "universe". Filling them with zeros or smoothing as Levi suggests would implicitly set the analysis to all areas, including those with no population. If that is a large group of cells, I imagine all you'd get is a map of empty areas and areas with someone on them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants