ENH: Graph IO to classic weights file formats #698

martinfleis · 2024-04-06T21:52:00Z

WIP and not very well tested (in a sense that I am not certain it is always 1:1 with weights implementation).

So far, GAL. I am also planning to look at GWT. Is there anything else that is commonly used?

codecov · 2024-04-06T21:57:48Z

Codecov Report

Attention: Patch coverage is 98.80952% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 85.0%. Comparing base (bcabdbc) to head (f014971).
Report is 15 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##            main    #698     +/-   ##
=======================================
- Coverage   85.0%   85.0%   -0.0%     
=======================================
  Files        141     145      +4     
  Lines      15203   15361    +158     
=======================================
+ Hits       12924   13055    +131     
- Misses      2279    2306     +27

Files	Coverage Δ
libpysal/__init__.py	`100.0% <100.0%> (ø)`
libpysal/graph/__init__.py	`100.0% <100.0%> (ø)`
libpysal/graph/_contiguity.py	`98.9% <100.0%> (+<0.1%)`	⬆️
libpysal/graph/_utils.py	`95.0% <100.0%> (+<0.1%)`	⬆️
libpysal/graph/base.py	`97.0% <100.0%> (-1.0%)`	⬇️
libpysal/graph/io/_gwt.py	`100.0% <100.0%> (ø)`
libpysal/graph/io/_parquet.py	`84.0% <ø> (ø)`
libpysal/graph/tests/test_base.py	`100.0% <100.0%> (ø)`
libpysal/graph/io/_gal.py	`96.2% <96.2%> (ø)`

... and 2 files with indirect coverage changes

martinfleis · 2024-04-08T19:50:59Z

Can someone with a bit of historical knowledge (@serge, @levi?) help me understand the treatment of headers here? GeoDa's User Guide from 2003 states for GAL that

When a Key Variable is specified, that line contains four values: 0 (reserved for future use), the number of observations (100), the name of the shape file (SIDS) and the variable name for the Key Variable (FIPSNO). When sequence numbers are used to label the observations, the header line only contains the number of observations

Our existing GAL writing uses only the number of observations in the header, but the actual IDs of observations, not the sequence number (which I translate that positional index (iloc)).

With GWT, the definition is practically the same:

When a Key Variable has been specified, the header line is as in Figure 126, for k-nearest neighbors of order 4 in the Columbus data set. It contains four items: 0 (for future use), the number of observations (49), the name of the shape file (COLUMBUS) and the Key Variable (POLYID). When no Key Variable is specified, but sequence numbers are used, the header line consists only of the number of observations.

But our implementation does not use just the number of observations like in the GAL case but 0 n_obs Unknown Unknown. And again the actual indices.

Given there is apparently no other documentation of these file formats, what are the correct headers?

If we should assume that with header consisting of a number of observations only, the IDs are positional indices, than what GAL is currently doing is wrong and we should do what GWT is doing in both. Though it makes a little sense to write that we don't know something.

Any clue how the header should look like for maximum compatibility? Anyone has spdep ready to check what they're doing?

sjsrey · 2024-04-09T13:56:51Z

Can someone with a bit of historical knowledge (@serge, @levi?) help me understand the treatment of headers here? GeoDa's User Guide from 2003 states for GAL that

When a Key Variable is specified, that line contains four values: 0 (reserved for future use), the number of observations (100), the name of the shape file (SIDS) and the variable name for the Key Variable (FIPSNO). When sequence numbers are used to label the observations, the header line only contains the number of observations

Our existing GAL writing uses only the number of observations in the header, but the actual IDs of observations, not the sequence number (which I translate that positional index (iloc)).

With GWT, the definition is practically the same:

When a Key Variable has been specified, the header line is as in Figure 126, for k-nearest neighbors of order 4 in the Columbus data set. It contains four items: 0 (for future use), the number of observations (49), the name of the shape file (COLUMBUS) and the Key Variable (POLYID). When no Key Variable is specified, but sequence numbers are used, the header line consists only of the number of observations.

But our implementation does not use just the number of observations like in the GAL case but 0 n_obs Unknown Unknown. And again the actual indices.

Given there is apparently no other documentation of these file formats, what are the correct headers?

If we should assume that with header consisting of a number of observations only, the IDs are positional indices, than what GAL is currently doing is wrong and we should do what GWT is doing in both. Though it makes a little sense to write that we don't know something.

Any clue how the header should look like for maximum compatibility? Anyone has spdep ready to check what they're doing?

Here is how spdep reads gwt files and gal files.

martinfleis · 2024-05-20T19:35:45Z

This should be ready for review now. Interestingly, it has uncovered a bug in our conversion from dicts to arrays (and adjacency), where the tooling was not able to process self-weights of 1 and always considered focal == neighbor as an isolate, giving it 0. That should be fixed now.

martinfleis · 2024-05-20T19:37:32Z

Also, regarding my questions above... it seems that spdep allows only integer IDs (positional) and given there is no documentation of either of those file format whatsoever, I tried to ensure that the graph IO matches the output of weights IO, so we are consistent with ourselves.

libpysal/graph/_contiguity.py

ljwolf · 2024-05-27T12:46:21Z

looks fine to me! good catch on the self-weight.

We need to be consistent about that, since esda will require an overhaul once it's done. Those statistics, especially the local ones, ignore self-weight effects.

GAL Graph IO

31f3b20

martinfleis added the graph label Apr 6, 2024

GWT

fa74d6e

martinfleis mentioned this pull request Apr 8, 2024

Formalisation of the Parquet spec of weights IO #699

Open

martinfleis added 2 commits May 20, 2024 21:31

tests + necessary bugfixes

e8db775

allow full path imports

402ece5

martinfleis marked this pull request as ready for review May 20, 2024 19:34

martinfleis added 3 commits May 20, 2024 21:53

compat

e9d5da4

Merge remote-tracking branch 'upstream/main' into graph-io

8fb1cbc

pull the latest tobler

f014971

martinfleis added the enhancement label May 20, 2024

ljwolf approved these changes May 27, 2024

View reviewed changes

libpysal/graph/_contiguity.py Outdated Show resolved Hide resolved

cleanup

ef6bfdc

martinfleis merged commit 25bb6a1 into pysal:main May 27, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Graph IO to classic weights file formats #698

ENH: Graph IO to classic weights file formats #698

martinfleis commented Apr 6, 2024

codecov bot commented Apr 6, 2024 •

edited

martinfleis commented Apr 8, 2024

sjsrey commented Apr 9, 2024 •

edited

martinfleis commented May 20, 2024

martinfleis commented May 20, 2024

ljwolf commented May 27, 2024

ENH: Graph IO to classic weights file formats #698

ENH: Graph IO to classic weights file formats #698

Conversation

martinfleis commented Apr 6, 2024

codecov bot commented Apr 6, 2024 • edited

Codecov Report

martinfleis commented Apr 8, 2024

sjsrey commented Apr 9, 2024 • edited

martinfleis commented May 20, 2024

martinfleis commented May 20, 2024

ljwolf commented May 27, 2024

codecov bot commented Apr 6, 2024 •

edited

sjsrey commented Apr 9, 2024 •

edited