Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weights sprint planning #534

Open
ljwolf opened this issue Jul 18, 2023 · 58 comments
Open

Weights sprint planning #534

ljwolf opened this issue Jul 18, 2023 · 58 comments

Comments

@ljwolf
Copy link
Member

ljwolf commented Jul 18, 2023

I'm starting this thread to begin discussions about the upcoming weights sprint. The current experimental implementation is in the geographs branch, the weights/experimental module. The ROADMAP.md will need updating...

@martinfleis
Copy link
Member

Things we have right now:

  • Base class
  • Generic constructors (from_arrays, from_sparse, from_old_w (for ease of testing), from_weights_dict, from_dicts), all channeled through from_arrays.
  • Contiguity constructors (from_contiguity)
  • A bunch of properties
  • Transformation
  • Handling of islands
  • basics tooling for kernel constructors
  • unfinished tooling for triangulation constructors
  • some bit of set ops but @ljwolf will know the state of that better
  • spatial lag

What needs to be done:

  • hook kernel constructors to the base class
  • finish triangulation and hook to the base class
  • tests
    • tests of the functionality independent of current codebase
    • tests against the current implementations. these will be gone at some point but we need to ensure that the results from old and new match
  • I/O
  • plotting
  • lag_categorical
  • finish set operations
  • fuzzy_contiguity
  • bunch of there utils we have now

Where we're blocked by scipy.sparse development

  • components
  • higher_order

My suggestion would be to focus on testing right now. The code is becoming too large and it shall be thoroughly covered before we start moving forward.

@martinfleis
Copy link
Member

I also think that we should start making PRs into the geography branch rather than direct commit to ensure proper review.

@ljwolf
Copy link
Member Author

ljwolf commented Jul 18, 2023

I will try to finish scipy/scipy#18544, which should unblock higher order. We will need to also identify why connected components fails in the csgraph side, but that should be doable tomorrow.

If necessary, we can build on top of the matrix classes for now and migrate once the arrays are feature complete?

@martinfleis
Copy link
Member

csgraph does seem to generally require matrix for now.

If necessary, we can build on top of the matrix classes for now and migrate once the arrays are feature complete?

Maybe as on-the-fly conversion when necessary?

@ljwolf
Copy link
Member Author

ljwolf commented Jul 20, 2023

We are currently not sure if there is an "adaptive" bandwidth parameter allowed in the kernel() function.

@ljwolf
Copy link
Member Author

ljwolf commented Jul 20, 2023

the kernel() function may also need to implement a "threshold" option that lets you keep only points closer than threshold as pairs.

@ljwolf
Copy link
Member Author

ljwolf commented Jul 20, 2023

Not going to be able to complete scipy/scipy#18544 today, but will aim to next sprint.

@knaaptime
Copy link
Member

knaaptime commented Jul 20, 2023

TLDR:

  • whats a the intended signature for instantiating a kernel-based W from a dataframe?
  • whats the signature for instantiating a kernel-based W from a precomputed distance matrix?

I'm trying to wrap my head around the idea of input data structures versus desired output as first-class citizens for constructing $W$s

so, after our call, i think the idea is to always treat the input data structure as primary. Right now, we have a from_contiguity constructor, but i'm sure I heard Levi say we won't have such a thing in the new structure (and analogously we won't have from_kernel. So, while we need to "hook kernel constructors to the base class", that doesnt mean creating a from_kernel class method thats analogous to from_contiguity

Instead, the idea is that from_contiguity will become private, and we will create a larger class method that's from_dataframe? So like, currently, the kernel functions take an argument called coordinates which gets passed to a geometry validator. That means the input must be a geodataframe, not, say a distance matrix. So even though we want to allow distance=precomputed, that wont work because L109 is forcing an array of geoms

so if we want to create a kernel-based W, and we are treating input data structure as first class, then i need to do something like

W.from_dataframe(**kernel_args?), because we want to avoid W.from_kernel(**input_data_args?)? if that's the case, then the primary purpose of W.from_dataframe() is to dispatch to helper functions that generate the correct data structure for different types of weights? Then it will end up subsuming from_contiguity and to create a queen weight, i'd do something like W.from_dataframe(gdf, 'contiguity', queen=True)?, conversely if i needed a kernel from precomputed distances id do something like W.from_csc(distance_matrix, 'kernel', kernel_function='gaussian')?, whereas if i needed it to build on the fly with euclidean distances i'd dl W.from_dataframe(gdf, 'kernel', kernel_function='gaussian'). This all applies to triangulation as well.

The alternative would be to leave desired output as first class, which would mean leaving from_contiguity intact, then adding analogous from_kernel, from_triangulation, etc

@ljwolf
Copy link
Member Author

ljwolf commented Jul 20, 2023

I did not recall we had implemented some of them already. That's my misunderstanding, so I was mistaken. Sorry!

If from_contiguity is already there, run with that and add them for now.

From a design angle, I feel like "from_kernel()" and the like would be confusing. To me, that implies "kernel" is an input data format, not the structure of the output of the method. "to_kernel()" would imply that, but I think then "kernel" should probably be a type? I hoped to avoid typing like this in favor of the construction functions outputting plain W objects, since we get into tricky inheritance consistency issues (like, is a kernel weighed voronoi an instance of rook, kernel, Delaunay, or all?). But, we may need to chat about this next sprint?

@knaaptime
Copy link
Member

knaaptime commented Jul 20, 2023

From a design angle, I feel like "from_kernel()" and the like would be confusing. To me, that implies "kernel" is an input data format, not the structure of the output of the method. "to_kernel()" would imply that, but I think then "kernel" should probably be a type?

yeah, completely agree. You dont do from_kernel, because your input data isnt "a kernel".

so the issue is we want a generic W class, and it should be instantiated explicitly from different data structures. But there are also genuinely different subtypes of W, as they mean different things and require different computations to create them. Before, we had subclasses for each type of W, and they inhereited the same constructors for different input formats. I agree we want to avoid weird inheritance, and subclassing W more generally, but then the only way i can think to handle the combination of w_type * data_structure is to slightly overload the arguments for data-structure-based class methods, like

W.from_dataframe(gdf, 'contiguity', queen=True)

that gets messy because it needs to accept all the kwargs for constructing all subtypes of W from that data structure, but its conceptually strauightforward (and the converse of our previous design, where w-subtype was first-class, and it inhereted constructors for the shared inuput data types)

@knaaptime
Copy link
Member

or do we take a generic w from input W.from_dataframe(gdf)

and define a conversion to_kernel, to_voronoi, etc?

@ljwolf
Copy link
Member Author

ljwolf commented Jul 20, 2023 via email

@sjsrey
Copy link
Member

sjsrey commented Jul 20, 2023

@ljwolf question on the signature of delaunay

Should the first arg be an array or a geoseries/gdf?

def delaunay(coordinates, ids=None, bandwidth=numpy.inf, kernel="boxcar"):
"""
Constructor of the Delaunay graph of a set of input points.
Relies on scipy.spatial.Delaunay and numba to quickly construct
a graph from the input set of points. Will be slower without numba,
and will warn if this is missing.
Parameters
----------
coordinates : array of points, (N,2)
numpy array of coordinates containing locations to compute the
delaunay triangulation

as later on geoms is passed for validation:

coordinates, ids, geoms = _validate_geometry_input(
geoms, ids=ids, valid_geom_types=_VALID_GEOMETRY_TYPES
)

Not sure if the docs were to be updated or the check requires some preprocessing as geoms is undefined as currently implemented. This made me think the first argument is intended to be something other than documented at present?

@martinfleis
Copy link
Member

I don't think having one from_dataframe method is a good idea. It would have a ton of conditional keywords and would be messy and hard to understand.

I think that constructors like from_triangulation could easily accept both GeoPandas objects and an array of coords.

or do we take a generic w from input W.from_dataframe(gdf)

and define a conversion to_kernel, to_voronoi, etc?

No. Once you create W you have the adjacency. From there, it shouldn't matter if that table was created based on contiguity, kernel or using manually encoded weights. You don't have any information that could be used to do conversion to voronoi at that point and you should not have imho.

I'm not sure I properly understand the problem though.

@ljwolf
Copy link
Member Author

ljwolf commented Jul 21, 2023

@ljwolf question on the signature of delaunay

Should the first arg be an array or a geoseries/gdf?

The input should be able to be either an n,2 array of coordinates, a geoseries of POINT, or a numpy array with POINT shapely geometries. The input coordinates should be validated using:

def _validate_geometry_input(geoms, ids=None, valid_geometry_types=None):
"""
Ensure that input geometries are always aligned to (and refer back to)
inputted geometries. Geoms can be a GeoSeries, GeoDataFrame, numpy.array
with a geometry dtype, or a point array.

and the output of that is used for the computation.

So, everywhere where we use _validate_geometry_input() in _triangulation, for example:

coordinates, ids, geoms = _validate_geometry_input(
geoms, ids=ids, valid_geom_types=_VALID_GEOMETRY_TYPES
)

geoms should be coordinates

@ljwolf
Copy link
Member Author

ljwolf commented Jul 21, 2023

I don't think having one from_dataframe method is a good idea. It would have a ton of conditional keywords and would be messy and hard to understand.

I think that constructors like from_triangulation could easily accept both GeoPandas objects and an array of coords.

or do we take a generic w from input W.from_dataframe(gdf)
and define a conversion to_kernel, to_voronoi, etc?

No. Once you create W you have the adjacency. From there, it shouldn't matter if that table was created based on contiguity, kernel or using manually encoded weights. You don't have any information that could be used to do conversion to voronoi at that point and you should not have imho.

I'm not sure I properly understand the problem though.

My preferred option would be:

  1. the constructor functions (kernel(), vertex_set_intersection(), rook(), delaunay(), gabriel()) take GeoSeries or arrays of coordinates, and return plain W objects. So, calls look like
  • my_w = kernel(my_gdf.geometry, k=3)
  • my_w = delaunay(my_coordinates).
  1. the W class itself uses from_* to read/convert existing objects into a W structure. So, calls look like:
  • W.from_networkx(nx_graph)
  • W.from_file('./columbus.gal')
  • W.from_sparse(my_coo)
  • W.from_adjacency(my_df, focal_col='here', neighbor_col='there', weight_col='strength')
  • W.from_arrays(my_df.here, my_df.there, my_df.strength)

No W.from_dataframe(), so that there's no confusion about what "dataframe" means.

So, instead of

W.from_dataframe(gdf, 'contiguity', queen=True)

We document/recommend

w = weights.contiguity.rook(gdf)

@martinfleis
Copy link
Member

We document/recommend

w = weights.contiguity.rook(gdf)

Mmm, not sure I am fan of that. The solution with from_contiguity and similar is elegant in a sense that you always create W with some W.from* constructor and you don't have fish in the documentation and different sub modules how to create one.

What is the issue with this API design? I'm still not getting the original problem with that that needs solving.

@ljwolf
Copy link
Member Author

ljwolf commented Jul 21, 2023

We document/recommend
w = weights.contiguity.rook(gdf)

Mmm, not sure I am fan of that. The solution with from_contiguity and similar is elegant in a sense that you always create W with some W.from* constructor and you don't have fish in the documentation and different sub modules how to create one.

What about pulling them up into the higher-level namespace? weights.rook(gdf)/weights.kernel(gdf)?

What is the issue with this API design? I'm still not getting the original problem with that that needs solving.

From @ljwolf #534 (comment)

I feel like "from_kernel()" and the like would be confusing. To me, that implies "kernel" is an input data format, not the structure of the output of the method. "to_kernel()" would imply that, but I think then "kernel" should probably be a type?

From @knaaptime #534 (comment)

yeah, completely agree. You dont do from_kernel, because your input data isnt "a kernel".

@knaaptime
Copy link
Member

knaaptime commented Jul 21, 2023

My preferred option would be:

  1. the constructor functions (kernel(), vertex_set_intersection(), rook(), delaunay(), gabriel()) take GeoSeries or arrays of coordinates, and return plain W objects. So, calls look like
  • my_w = kernel(my_gdf.geometry, k=3)
  • my_w = delaunay(my_coordinates).
  1. the W class itself uses from_* to read/convert existing objects into a W structure. So, calls look like:
  • W.from_networkx(nx_graph)
  • W.from_file('./columbus.gal')
  • W.from_sparse(my_coo)
  • W.from_adjacency(my_df, focal_col='here', neighbor_col='there', weight_col='strength')
  • W.from_arrays(my_df.here, my_df.there, my_df.strength)

No W.from_dataframe(), so that there's no confusion about what "dataframe" means.

So, instead of

W.from_dataframe(gdf, 'contiguity', queen=True)

We document/recommend

w = weights.contiguity.rook(gdf)

i get it. So we have constructor functions rather than classmethod constructors when you need to generate the encoding, and _from classmethods when you already have the encoding laid out in an existing data structure. Thats a change of pace, but it makes sense to me

What about pulling them up into the higher-level namespace? weights.rook(gdf)/weights.kernel(gdf)?

maybe this. I do think it would make things easier to do from libpysal import weights and be able to tab complete to see there's a rook function, etc. That makes it a little clearer rook is definitely a weights builder

@knaaptime
Copy link
Member

knaaptime commented Jul 21, 2023

What is the issue with this API design? I'm still not getting the original problem with that that needs solving.

@martinfleis if we still need to "hook kernel (and triangulation) constructors up to the base class", what does that mean? whats a the intended signature for instantiating a kernel-based W from a dataframe?
whats the signature for instantiating a kernel-based W from a precomputed distance matrix? (since we agree _from_kernel doesnt make sense)?

@ljwolf
Copy link
Member Author

ljwolf commented Jul 21, 2023

kernel weight from a precomputed distance matrix

That should be

weights.kernel(D, metric="precomputed", kernel="gaussian")

@martinfleis
Copy link
Member

I originally assumed that it means from_kernel and from_triangulation methods. But if you don't think that it is a good idea let's figure out a better option.

I suggest we try to ensure we have the internal functions that create those adjacency tables finished first and the we can discuss the public API during a call on Friday during the next sprint? (Note that I have only my phone with me until Sunday so I haven't checked the actual state of the code)

@martinfleis
Copy link
Member

martinfleis commented Jul 26, 2023

This is what I had in mind.

    @classmethod
    def from_kernel(
        cls,
        data,
        bandwidth=None,
        metric="euclidean",
        kernel="gaussian",
        k=None,
        p=2,
    ):
        """_summary_

        Parameters
        ----------
        data : numpy.ndarray, geopandas.GeoSeries, geopandas.GeoDataFrame
            geometries over which to compute a kernel. If a geopandas object with Point
            geoemtry is provided, the .geometry attribute is used. If a numpy.ndarray
            with shapely geoemtry is used, then the coordinates are extracted and used.
            If a numpy.ndarray of a shape (2,n) is used, it is assumed to contain x, y
            coordinates. If metric="precomputed", data is assumed to contain a
            precomputed distance metric.
        bandwidth : float (default: None)
            distance to use in the kernel computation. Should be on the same scale as
            the input coordinates.
        metric : string or callable (default: 'euclidean')
            distance function to apply over the input coordinates. Supported options
            depend on whether or not scikit-learn is installed. If so, then any
            distance function supported by scikit-learn is supported here. Otherwise,
            only euclidean, minkowski, and manhattan/cityblock distances are admitted.
        kernel : string or callable (default: 'gaussian')
            kernel function to apply over the distance matrix computed by `metric`.
            The following kernels are supported:
                - triangular:
                - parabolic:
                - gaussian:
                - bisquare:
                - cosine:
                - boxcar/discrete: all distances less than `bandwidth` are 1, and all
                    other distances are 0
                - identity/None : do nothing, weight similarity based on raw distance
                - callable : a user-defined function that takes the distance vector and
                    the bandwidth and returns the kernel: kernel(distances, bandwidth)
        k : int (default: None)
            number of nearest neighbors used to truncate the kernel. This is assumed
            to be constant across samples. If None, no truncation is conduted.
        ids : numpy.narray (default: None)
            ids to use for each sample in coordinates. Generally, construction functions
            that are accessed via W.from_kernel() will set this automatically from
            the index of the input. Do not use this argument directly unless you intend
            to set the indices separately from your input data. Otherwise, use
            data.set_index(ids) to ensure ordering is respected. If None, then the index
            from the input coordinates will be used.
        p : int (default: 2)
            parameter for minkowski metric, ignored if metric != "minkowski".

        Returns
        -------
        W
            libpysal.weights.experimental.W
        """

This in theory supports also DistanceBand and KNN but it may be worth exposing those as separate constructors for the sake of user friendliness. We can pipe those through this internally.

    @classmethod
    def from_triangulation(
        cls,
        data,
        method="delaunay",
        bandwidth=np.inf,
        kernel="boxcar",
        clip="extent",
        contiguity_type="rook",
    ):
        """_summary_

        Parameters
        ----------
        data : numpy.ndarray, geopandas.GeoSeries, geopandas.GeoDataFrame
            geometries containing locations to compute the
            delaunay triangulation. If a geopandas object with Point
            geoemtry is provided, the .geometry attribute is used. If a numpy.ndarray
            with shapely geoemtry is used, then the coordinates are extracted and used.
            If a numpy.ndarray of a shape (2,n) is used, it is assumed to contain x, y
            coordinates.
        method : str, (default "delaunay")
            method of extracting the weights from triangulation. Supports:
                - delaunay
                - gabriel
                - relative_neighborhood
                - voronoi
        bandwidth : _type_, optional
            distance to use in the kernel computation. Should be on the same scale as
            the input coordinates, by default numpy.inf
        kernel : str, optional
            kernel function to use in order to weight the output graph. See
            :meth:`W.from_kernel` for details. By default "boxcar"
        clip :str (default: 'bbox')
            An overloaded option about how to clip the voronoi cells passed to
            cg.voronoi_frames() when method="voronoi. Ignored otherwise.
            Default is ``'extent'``. Options are as follows.

            * ``'none'``/``None`` -- No clip is applied. Voronoi cells may be
                arbitrarily larger that the source map. Note that this may lead to
                cells that are many orders of magnitude larger in extent than the
                original map. Not recommended.
            * ``'bbox'``/``'extent'``/``'bounding box'`` -- Clip the voronoi cells to
                the bounding box of the input points.
            * ``'chull``/``'convex hull'`` -- Clip the voronoi cells to the convex hull
                of the input points.
            * ``'ashape'``/``'ahull'`` -- Clip the voronoi cells to the tightest hull
                that contains all points (e.g. the smallest alphashape, using
                ``libpysal.cg.alpha_shape_auto``).
            * Polygon -- Clip to an arbitrary Polygon.
        contiguity_type : str, optional
            What kind of contiguity to apply to the voronoi diagram when ethod="voronoi.
            Ignored otherwise. Supports "rook" and "queen", by default "rook".

        Returns
        -------
        W
            libpysal.weights.experimental.W
        """

@ljwolf I think that for voronoi, we don't need to expose rook and queen constructors and can go with the vertex method only because the diagram is always topologically precise.

Was there any discussion on these last time that concluded that this is not an optimal API?

@knaaptime
Copy link
Member

Was there any discussion on these last time that concluded that this is not an optimal API?

yeah, lets discuss the tradeoffs on friday. Given the initial layout in the experimental branch, what you sketched above was my original understanding too. But I think we decided the from_* nomenclature is a bit misleading. When you do from_kernel above, you pass the function a geodataframe, not a kernel. The same with triangulation; the first argument is either a geometry object or a list of coordinates, then an intermediate kernel/triangulation is constructed immediately before giving you a weights object. So it's not really weights object from_kernel, it's kernel from_dataframe to weights object.

I can see arguments both ways. It feels natural to do .from_kernel in line with from_contiguity, but the semantics of that aren't actually accurate

@martinfleis
Copy link
Member

Got it. Let's pick it up on Friday. We may keep the signature and just change the names to something more fitting eventually.

@ljwolf
Copy link
Member Author

ljwolf commented Jul 28, 2023

So, on the sprint call, we considered two following APIs:

all constructors "live" on the W

In this example, the core design principle is:

W is self-contained. Most of what you need to do can be done only with the W object, not requiring the weights namespace.

The W.build_...() idiom constructs a W under a known graph construction rule, such as:

W.build_rook(geodataframe.geometry) # calls weights._contiguity.rook(geodataframe.geometry)
W.build_kernel(coordinate_array) # calls weights._kernel.kernel(coordinate_array)

The W.from_...() idiom constructs a W from a specific datastructure

W.from_file("columbus.gal")
W.from_arrays(head, tail, weight)
W.from_sparse(my_coo_array)

To make good on this design decision, we should also implement a .higher_order(), a .lag(), etc. This is a pandas-style API, rather than expecting users to switch between functions in weights and methods on W. This allows for method-chaining, such as:

W.build_kernel(geoseries).transform('r').lag(geoseries.home_value)

all constructors "live" in a non-object namespace

In this example, the core design principle is

W is only a results container. Functions in weights operate on top of W objects and generally construct or mutate them.

We would have to implement a construct namespace, so that:

weights.construct.contiguity(geodataframe.geometry) # calls weights._contiguity.rook(geodataframe.geometry) by default
# or the constructor actually lives at `weights.construct.contiguity`
weights.construct.kernel(coordinate_array) # calls weights._kernel.kernel(coordinate_array)

An ingest (convert?) namespace takes a dataframe and "ingests" it to a W:

weights.ingest.file("columbus.gal")
weights.ingest.arrays(head, tail, weight)
weights.ingest.sparse(my_coo_array)

and any of our functions in weights act on W objects to build (or mutate) W objects. This is a numpy-style API.

@knaaptime @martinfleis @ljwolf @rongboxu agree on the method strategy, option 1.

@ljwolf
Copy link
Member Author

ljwolf commented Jul 28, 2023

On the coincident points issue identified in Delaunay, we need to be consistent about how the new implementation deals with coincident points. Naively, Qhull implements a bunch of options. but we need to define a consistent way to deal with coincident points in KNN, too.

@ljwolf
Copy link
Member Author

ljwolf commented Jul 28, 2023

For handling coincident points, we discussed the following options:

jitter input upon coincidence

randomly permute/jitter the input points if there is a coincident point

construct a clique

reduce the dataset to unique locations, construct the weight, and then introduce a "clique" into the graph connecting all coincident points together and also to other neighbours according to the construction rule.

raise every time that there is a coincident point

detect that there is a coincident point and raise an error if that is the case.

@knaaptime @sjsrey @martinfleis @TaylorOshan @rongboxu @ljwolf agree that the implementation should generally follow the "raise" strategy.

All weights constructors should take a "coincident" argument that can take three values:

  • "raise" Default, raise an error when the coincident points would create an invalid output. This affects all Delaunay/Voronoi-type weights, and KNN when the number of duplicated points is greater than the requested k.
  • "clique" Compute the weights on unique locations in the data. Induce a "clique" into the graph, connecting all coincident points to one another and to the other values for that location.
  • "jitter" Add a very small random displacement (should use a fraction of the smallest pairwise distance) to all coordinates.

@sjsrey
Copy link
Member

sjsrey commented Jul 28, 2023

@sjsrey @knaaptime @ljwolf @martinfleis agreed to rename W to Graph and move the namespace.

@ljwolf
Copy link
Member Author

ljwolf commented Aug 11, 2023

I've just realised: for islands, the nonzero property will count them as a nonzero, since self.sparse.nnz counts "explicit" zeros as nonzero. This might throw off calculations in join counts, since a self-join is always going to contribute to positive autocorrelation (other cases, like Getis-Ord and Moran, we need it)...

We might want to think about using n_edges/n_nodes and then define nonzero (number of non-self loops) as self.n_edges - len(self.isolates)

@martinfleis
Copy link
Member

n_edges is just len(self._adjacency) no?

@ljwolf
Copy link
Member Author

ljwolf commented Aug 11, 2023

yep

@knaaptime
Copy link
Member

id ordering not respected when roundtripping through sparse

https://gist.github.com/knaaptime/d17b2d233c663aeec3e0e95487910560

@martinfleis
Copy link
Member

@knaaptime regarding this roundtripping - the distance band with isolates or without isolates works fine. The Graphs are equal, the sorting within a single focal does not matter. Sparse is also equal based on both (original and roundtripped).

The issue with KNN has nothing to do with sorting or roundtripping but with duplicated point locations. KNN builder drops 4 observations. That will be partially solved by the application of #548 to KNN. I will also add length checks to raise when the ids != length of the sparse.

@martinfleis
Copy link
Member

martinfleis commented Aug 28, 2023

More generally on sorting - I believe that the only sorting that matters is the position of the first observation of each focal. We only need to ensure that the arrays are sorted in a way that ensures that focal.unique() is equal to the index of original observations from which the Graph is being built. That ensures that any conversion depending on the order is correct since it will either depend on _id2i based on .unique() or on factorize, resulting in the same.

Edit: This works under an assumption that every observation is present in focal, either with a neighbor or as a self-loop with 0.

@knaaptime
Copy link
Member

More generally on sorting - I believe that the only sorting that matters is the position of the first observation of each focal. We only need to ensure that the arrays are sorted in a way that ensures that focal.unique() is equal to the index of original observations from which the Graph is being built.

exactly. We always need to respect the index of the original input, which means sometimes we are forced to sort. That's where i ended up in #513 too, so i just used a reindex on those ids at both levels, which is the same as keeping id2i around. Groupby ops will break that ordering, so we need to be able to reindex by the focal ordering arbitrarily (which is why i let sort_neighbors stick around). That way 'the position of the first observation of each focal' is always respected

I was going directly from adj to sparse, so had to keep track of the column indices as well otherwise you hit the pivot issue

@ljwolf
Copy link
Member Author

ljwolf commented Aug 28, 2023

I think that, insofar as we have used the term, "sorting" refers to re-ordering the weights into an arbitrary order with respect to the input data. Our goal of "no sorting" should be more clearly defined as: output sparse matrices should always conform to the order of the input data. I don't think column ordering/tail ordering can be guaranteed unless ids are provided?

@knaaptime
Copy link
Member

Our goal of "no sorting" should be more clearly defined as: output sparse matrices should always conform to the order of the input data.

isnt this what im saying?

I don't think column ordering/tail ordering can be guaranteed unless ids are provided?

i dont think im following. If no ids are provided, then the ordering of the columns is the range index of the rows? (if no ids are provided, then the index is the id?).

if we always know that both rows and colums of sparse are ordered in the canonical sorting given by the rows of the input, isnt that ideal?

@knaaptime
Copy link
Member

if we always sort both levels of the adjlist by the order given in the input data, then its in the same conceptual structure as a sparse matrix with both rows and cols arranged i through j. That way, when you pivot from table to sparse everything is in the right place without needing to specify row/col labels

@martinfleis
Copy link
Member

Our goal of "no sorting" should be more clearly defined as: output sparse matrices should always conform to the order of the input data.

If we create _id2i on __init__, then the adjacency can be ordered as random as it can get. sparse does not depend directly on the order of adjacency but on the _id2i mapping. The only other thing apart from unique_ids that is used to create _id2i that currently assumes some sorting is to_W which can be refactored. Then with stuff like cliques, we take the original index, create _id2i and then just simply dump cliques at the end of the adjacency, no sorting needed.

This means that we either expose this on a subset of constructors (like from_arrays) or always derive it from given data, posing some assumptions on how the input data should look like.

@knaaptime
Copy link
Member

right we can always create id2i from the input data, so the adjlist should always respect that order. In the current structure, the adjlist is sometimes random, but sparse is always ordered according to id2i. I think thats inconsistent and its better to just say id2i is the sort order. And i think its much easier to rely on pandas indices for handling indexing rather than rolling from scratch

we're not posing any assumptions on the input data on my implementation, we're just saying the adjlist order should match the sparse order

@knaaptime
Copy link
Member

knaaptime commented Aug 28, 2023

sparse does not depend directly on the order of adjacency but on the _id2i mapping.

so, since id2i is static because it needs to match the input order, it means sparse is static, so why is adjlist arbitrary? That just creates one more index we need to map between. Lets just use a single index, since its the only one that matters (which was the original idea of using a sparse multiindex series)

@martinfleis
Copy link
Member

So you would enforce sorting of the adjacency and drop id2i, since we can infer it from the order?

@knaaptime
Copy link
Member

sparse does not depend directly on the order of adjacency but on the _id2i mapping.

yeah, i think this is my point

So you would enforce sorting of the adjacency and drop id2i, since we can infer it from the order?

i think so, but maybe we stash it for convenience. I think id2i forces us to consider that there is an inherent order to the data

The whole idea of using a data frame as the central structure is to avoid doing the mapping and indexing ourselves. Why in the world would we let the adjlist be random when we're using the indices to provide structure? if id2i is structured, and sparse is structured, but adjlist is random (at the neighbor level), then how is that the 'canonical' representation of the graph? Lets just get it into the correct shape once, and let adjacency be the actual central construct.

The only other thing apart from unique_ids that is used to create _id2i that currently assumes some sorting is to_W which can be refactored.

and i think maybe this is the crux of it. We're still struggling with whether the graph is ordered or not. Since the ordering of the rows matters (needs to match input order for things like lag), then the graph is ordered at one level. If we are forced to recognize that order, then lets actually recognize it. We already have a formally-ordered output in sparse--and that order matters, so it's not true that to_W is the only thing that depends on ordering...

Since formal order matters in sparse, why allow an arbitrarily-ordered one in adjacency?. I think our cognitive dissonance is showing if we say order matters for one representation, but the 'canonical data structure' at the center of the graph is arbitrary.

@martinfleis
Copy link
Member

I think that this boils down to what "no sorting" means. It can mean "no sorting is expected from adjacency" or "no sorting happens during internal operations". The former requires id2i and sorting when casting to sparse, the latter requires sorting on initialisation. Neither is fully "no sorting" but the latter may be preferable.

@martinfleis
Copy link
Member

Also, sorting on init can not be just sort_values(['focal', 'neighbor']) but sort using the original order of the index of geometries.

@knaaptime
Copy link
Member

I think that this boils down to what "no sorting" means. It can mean "no sorting is expected from adjacency" or "no sorting happens during internal operations". The former requires id2i and sorting when casting to sparse, the latter requires sorting on initialisation. Neither is fully "no sorting" but the latter may be preferable.

yes, well said. I think we're sorting no matter what, so 'explicit is better than implicit' and i'd prefer to do it on init (and say we don't sort after)

Also, sorting on init can not be just sort_values(['focal', 'neighbor']) but sort using the original order of the index of geometries.

so on init, i would do something like this. Set the index as ['focal','neighbor'], then reindex both levels by geometry index and reset the index

...and luckily, this is what raised the issue 😁. So instead of having this happen only in from_adjacency, i'd have it in the init

@martinfleis
Copy link
Member

One thing this doesn't solve is the existence of id2i, because for any other index than RangeIndex(0, n), we still need, even on the fly, mapping from ids to integers when going to sparse. That trick with pandas sparse you used before is applicable only for sparse.matrix, not sparse.array, so I'm not sure we want to depend on that for now.

@knaaptime
Copy link
Member

knaaptime commented Aug 28, 2023

ah snap. bummer. but i think thats ok... lets keep id2i as a private attribute and use that as our 'reindexer'?

so, without changing much code: id2i is the order of the graph, so all representations can be expected to follow that structure. Since i was using a multiindexed series instead of a dataframe, the first level of the index defines the order of the graph

@martinfleis
Copy link
Member

Probably It is also used as inverse i2id in stuff that is computed on sparse, like asymmetry. We could do them on the fly based on the sorted index but then a private cached property as we have now is probably a better solution.

@knaaptime
Copy link
Member

knaaptime commented Aug 28, 2023

the way i spec'd it, the adjlist is the source of truth. So a private attribute stores the sort order from the first level of the multiindex, and that defines the canonical order in all of [adjlist, neighbors, weights, sparse]. I think thats preferable because the canonical order is always consistently defined across representations (based on the core data structure). In that case, we're saying a property of the Graph is that it's always ordered with respect to the input data, and that order is defined by the adjacency table.

Since that's already a requirement of some representations, it's better to just standardize the convention and adopt it throughout

@martinfleis
Copy link
Member

Sounds good to me. Feel free to go ahead and give it a go. I won't be able to work on it much this week.

@knaaptime
Copy link
Member

a few more updates in the gist

tldr, do we want to allow the builders to accept a sparse matrix as input when metric='precompted'? seems now they only accept dense. I wasn't sure if that's intended or not. If we accept sparse, then we also need to accept ids?

That trick with pandas sparse you used before is applicable only for sparse.matrix, not sparse.array, so I'm not sure we want to depend on that for now.

if we need an array, cant we do scipy.sparse.csr_array(sparse_matrix)?

@jGaboardi
Copy link
Member

@knaaptime that 3-4 second timing pretty wild...

@ljwolf
Copy link
Member Author

ljwolf commented Aug 31, 2023

The _kernel() constructor should take a sparse matrix correctly, and I'd like to make sure that propagates upwards to build_kernel() method. I also had implemented ways to get distance band and knn off of sparse distance matrices, but I don't know where that ended up in the re-shuffle of _kernel() yet. My goal in implementing _kernel() initially was for everything in _distance.py to accept sparse metric='precomputed' input. The _contiguity.py and _triangulation.py constructors won't need to (indeed, I don't think Gabriel/delaunay triangulations are defined over anything but Minkowski metrics, and I don't think scipy supports non-euclidean Delaunay?)

@knaaptime
Copy link
Member

ok, i figured that was the goal. I think both _knn and _distanceband are also setup to take sparse without much effort, they just need the same check as _kernel that bypasses the geomvalidator when sparse is input

one thing i noticed just now is the default kernel for the graph is gaussian but for W it was triangular, so we might want to surface that

@martinfleis
Copy link
Member

@ljwolf Any reason to not default to the same kernel?

@ljwolf
Copy link
Member Author

ljwolf commented Sep 1, 2023

Nope! Minor mathematical reasons, scikit generally defaults to Gaussian kernels, but I'm definitely not wedded to it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants