Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Graph IO to classic weights file formats #698

Merged
merged 8 commits into from
May 27, 2024

Conversation

martinfleis
Copy link
Member

WIP and not very well tested (in a sense that I am not certain it is always 1:1 with weights implementation).

So far, GAL. I am also planning to look at GWT. Is there anything else that is commonly used?

Copy link

codecov bot commented Apr 6, 2024

Codecov Report

Attention: Patch coverage is 98.80952% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 85.0%. Comparing base (bcabdbc) to head (f014971).
Report is 15 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##            main    #698     +/-   ##
=======================================
- Coverage   85.0%   85.0%   -0.0%     
=======================================
  Files        141     145      +4     
  Lines      15203   15361    +158     
=======================================
+ Hits       12924   13055    +131     
- Misses      2279    2306     +27     
Files Coverage Δ
libpysal/__init__.py 100.0% <100.0%> (ø)
libpysal/graph/__init__.py 100.0% <100.0%> (ø)
libpysal/graph/_contiguity.py 98.9% <100.0%> (+<0.1%) ⬆️
libpysal/graph/_utils.py 95.0% <100.0%> (+<0.1%) ⬆️
libpysal/graph/base.py 97.0% <100.0%> (-1.0%) ⬇️
libpysal/graph/io/_gwt.py 100.0% <100.0%> (ø)
libpysal/graph/io/_parquet.py 84.0% <ø> (ø)
libpysal/graph/tests/test_base.py 100.0% <100.0%> (ø)
libpysal/graph/io/_gal.py 96.2% <96.2%> (ø)

... and 2 files with indirect coverage changes

@martinfleis
Copy link
Member Author

Can someone with a bit of historical knowledge (@serge, @levi?) help me understand the treatment of headers here? GeoDa's User Guide from 2003 states for GAL that

When a Key Variable is specified, that line contains four values: 0 (reserved for future use), the number of observations (100), the name of the shape file (SIDS) and the variable name for the Key Variable (FIPSNO). When sequence numbers are used to label the observations, the header line only contains the number of observations

Our existing GAL writing uses only the number of observations in the header, but the actual IDs of observations, not the sequence number (which I translate that positional index (iloc)).

With GWT, the definition is practically the same:

When a Key Variable has been specified, the header line is as in Figure 126, for k-nearest neighbors of order 4 in the Columbus data set. It contains four items: 0 (for future use), the number of observations (49), the name of the shape file (COLUMBUS) and the Key Variable (POLYID). When no Key Variable is specified, but sequence numbers are used, the header line consists only of the number of observations.

But our implementation does not use just the number of observations like in the GAL case but 0 n_obs Unknown Unknown. And again the actual indices.

Given there is apparently no other documentation of these file formats, what are the correct headers?

If we should assume that with header consisting of a number of observations only, the IDs are positional indices, than what GAL is currently doing is wrong and we should do what GWT is doing in both. Though it makes a little sense to write that we don't know something.

Any clue how the header should look like for maximum compatibility? Anyone has spdep ready to check what they're doing?

@sjsrey
Copy link
Member

sjsrey commented Apr 9, 2024

Can someone with a bit of historical knowledge (@serge, @levi?) help me understand the treatment of headers here? GeoDa's User Guide from 2003 states for GAL that

When a Key Variable is specified, that line contains four values: 0 (reserved for future use), the number of observations (100), the name of the shape file (SIDS) and the variable name for the Key Variable (FIPSNO). When sequence numbers are used to label the observations, the header line only contains the number of observations

Our existing GAL writing uses only the number of observations in the header, but the actual IDs of observations, not the sequence number (which I translate that positional index (iloc)).

With GWT, the definition is practically the same:

When a Key Variable has been specified, the header line is as in Figure 126, for k-nearest neighbors of order 4 in the Columbus data set. It contains four items: 0 (for future use), the number of observations (49), the name of the shape file (COLUMBUS) and the Key Variable (POLYID). When no Key Variable is specified, but sequence numbers are used, the header line consists only of the number of observations.

But our implementation does not use just the number of observations like in the GAL case but 0 n_obs Unknown Unknown. And again the actual indices.

Given there is apparently no other documentation of these file formats, what are the correct headers?

If we should assume that with header consisting of a number of observations only, the IDs are positional indices, than what GAL is currently doing is wrong and we should do what GWT is doing in both. Though it makes a little sense to write that we don't know something.

Any clue how the header should look like for maximum compatibility? Anyone has spdep ready to check what they're doing?

Here is how spdep reads gwt files and gal files.

@martinfleis martinfleis marked this pull request as ready for review May 20, 2024 19:34
@martinfleis
Copy link
Member Author

This should be ready for review now. Interestingly, it has uncovered a bug in our conversion from dicts to arrays (and adjacency), where the tooling was not able to process self-weights of 1 and always considered focal == neighbor as an isolate, giving it 0. That should be fixed now.

@martinfleis
Copy link
Member Author

Also, regarding my questions above... it seems that spdep allows only integer IDs (positional) and given there is no documentation of either of those file format whatsoever, I tried to ensure that the graph IO matches the output of weights IO, so we are consistent with ourselves.

libpysal/graph/_contiguity.py Outdated Show resolved Hide resolved
@ljwolf
Copy link
Member

ljwolf commented May 27, 2024

looks fine to me! good catch on the self-weight.

We need to be consistent about that, since esda will require an overhaul once it's done. Those statistics, especially the local ones, ignore self-weight effects.

@martinfleis martinfleis merged commit 25bb6a1 into pysal:main May 27, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants