Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation on the effects of duplicates in the source geometries #182

Open
darribas opened this issue Aug 15, 2023 · 1 comment
Open
Labels
documentation Improvements or additions to documentation

Comments

@darribas
Copy link
Member

I don't think this is necessarily a bug, but it is something that caught me off guard until I thought it through, and could trip up other users, so maybe the solution is adding a bit of documentation.

In areal interpolation (not sure about other cases), if the source geometries have duplicates or overlaps, the results are wrong. At least for categoricals (I'm not sure what would happen to intensive/extensive, but I think something similar), some percentages add up to more than 1. My sense is this comes from more than one source geometry covering the same patch of land, which then causes it to be counted more than once. Again, this is what the method would do and, arguably, a strange case (it's unusual to have overlapping/duplicate source geometries), but maybe worth adding a line on the source_df documentation?

source_df : geopandas.GeoDataFrame

What do you think?

@darribas darribas added the documentation Improvements or additions to documentation label Aug 15, 2023
@knaaptime
Copy link
Member

knaaptime commented Aug 21, 2023

In areal interpolation (not sure about other cases), if the source geometries have duplicates or overlaps, the results are wrong.

not quite. The validity depends on the question. if you've got data on, say, overlapping school districts (some private, some public) and you're sending average test scores to a smaller geometry, then the target geometry contains the weighted average of the area covered by the overlapping polys (which is what you want in this case). If that small poly is covered entirely by two different overlapping schools, one private and one public, then the target gets 50/50 shares

if you've got an extensive variable with overlapping sources (and those overlaps are conceptually valid in the source data,) then the overlapping sum is correct

non-planar geometries are something that can obviously surface a lot in interpolation problems, so i've thought a few times about includng some sort of check, but ultimately non-planarity also a basic data check and something the user needs to understand about their data, so i've landed on the idea that folks should use https://github.com/sjsrey/geoplanar when they need to check their data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants