-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weights sprint planning #534
Comments
Things we have right now:
What needs to be done:
Where we're blocked by scipy.sparse development
My suggestion would be to focus on testing right now. The code is becoming too large and it shall be thoroughly covered before we start moving forward. |
I also think that we should start making PRs into the geography branch rather than direct commit to ensure proper review. |
I will try to finish scipy/scipy#18544, which should unblock higher order. We will need to also identify why connected components fails in the csgraph side, but that should be doable tomorrow. If necessary, we can build on top of the matrix classes for now and migrate once the arrays are feature complete? |
csgraph does seem to generally require matrix for now.
Maybe as on-the-fly conversion when necessary? |
We are currently not sure if there is an "adaptive" bandwidth parameter allowed in the |
the |
Not going to be able to complete scipy/scipy#18544 today, but will aim to next sprint. |
TLDR:
I'm trying to wrap my head around the idea of input data structures versus desired output as first-class citizens for constructing $W$s so, after our call, i think the idea is to always treat the input data structure as primary. Right now, we have a Instead, the idea is that so if we want to create a kernel-based W, and we are treating input data structure as first class, then i need to do something like
The alternative would be to leave desired output as first class, which would mean leaving |
I did not recall we had implemented some of them already. That's my misunderstanding, so I was mistaken. Sorry! If from_contiguity is already there, run with that and add them for now. From a design angle, I feel like "from_kernel()" and the like would be confusing. To me, that implies "kernel" is an input data format, not the structure of the output of the method. "to_kernel()" would imply that, but I think then "kernel" should probably be a type? I hoped to avoid typing like this in favor of the construction functions outputting plain W objects, since we get into tricky inheritance consistency issues (like, is a kernel weighed voronoi an instance of rook, kernel, Delaunay, or all?). But, we may need to chat about this next sprint? |
yeah, completely agree. You dont do so the issue is we want a generic W class, and it should be instantiated explicitly from different data structures. But there are also genuinely different subtypes of W, as they mean different things and require different computations to create them. Before, we had subclasses for each type of W, and they inhereited the same constructors for different input formats. I agree we want to avoid weird inheritance, and subclassing W more generally, but then the only way i can think to handle the combination of w_type * data_structure is to slightly overload the arguments for data-structure-based class methods, like
that gets messy because it needs to accept all the kwargs for constructing all subtypes of W from that data structure, but its conceptually strauightforward (and the converse of our previous design, where w-subtype was first-class, and it inhereted constructors for the shared inuput data types) |
or do we take a generic w from input and define a conversion |
Maybe “from_dataframe” should be the constructor for an arbitrary adjacency table?
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: eli knaap ***@***.***>
Sent: Thursday, July 20, 2023 6:12:03 PM
To: pysal/libpysal ***@***.***>
Cc: Levi John Wolf ***@***.***>; Mention ***@***.***>
Subject: Re: [pysal/libpysal] Weights sprint planning (Issue #534)
This message could be from someone attempting to impersonate a member of UoB. Please do not share information with the sender without verifying their identity. If in doubt, please contact the IT Service Desk for advice. --
or do we take a generic w from input W.from_dataframe(gdf)
and define a conversion to_kernel, to_voronoi, etc?
—
Reply to this email directly, view it on GitHub<#534 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AARFR4YZZN42OQ5YHBEZKW3XRFRGHANCNFSM6AAAAAA2OCBVBU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@ljwolf question on the signature of Should the first arg be an array or a geoseries/gdf? libpysal/libpysal/weights/experimental/_triangulation.py Lines 25 to 36 in 88fb4b6
as later on libpysal/libpysal/weights/experimental/_triangulation.py Lines 77 to 79 in 88fb4b6
Not sure if the docs were to be updated or the check requires some preprocessing as |
I don't think having one I think that constructors like
No. Once you create W you have the adjacency. From there, it shouldn't matter if that table was created based on contiguity, kernel or using manually encoded weights. You don't have any information that could be used to do conversion to voronoi at that point and you should not have imho. I'm not sure I properly understand the problem though. |
The input should be able to be either an libpysal/libpysal/weights/experimental/_utils.py Lines 25 to 29 in 88fb4b6
and the output of that is used for the computation. So, everywhere where we use libpysal/libpysal/weights/experimental/_triangulation.py Lines 77 to 79 in 88fb4b6
|
My preferred option would be:
No So, instead of
We document/recommend
|
Mmm, not sure I am fan of that. The solution with What is the issue with this API design? I'm still not getting the original problem with that that needs solving. |
What about pulling them up into the higher-level namespace?
From @ljwolf #534 (comment)
From @knaaptime #534 (comment)
|
i get it. So we have constructor functions rather than classmethod constructors when you need to generate the encoding, and
maybe this. I do think it would make things easier to do |
@martinfleis if we still need to "hook kernel (and triangulation) constructors up to the base class", what does that mean? whats a the intended signature for instantiating a kernel-based W from a dataframe? |
That should be weights.kernel(D, metric="precomputed", kernel="gaussian") |
I originally assumed that it means I suggest we try to ensure we have the internal functions that create those adjacency tables finished first and the we can discuss the public API during a call on Friday during the next sprint? (Note that I have only my phone with me until Sunday so I haven't checked the actual state of the code) |
This is what I had in mind. @classmethod
def from_kernel(
cls,
data,
bandwidth=None,
metric="euclidean",
kernel="gaussian",
k=None,
p=2,
):
"""_summary_
Parameters
----------
data : numpy.ndarray, geopandas.GeoSeries, geopandas.GeoDataFrame
geometries over which to compute a kernel. If a geopandas object with Point
geoemtry is provided, the .geometry attribute is used. If a numpy.ndarray
with shapely geoemtry is used, then the coordinates are extracted and used.
If a numpy.ndarray of a shape (2,n) is used, it is assumed to contain x, y
coordinates. If metric="precomputed", data is assumed to contain a
precomputed distance metric.
bandwidth : float (default: None)
distance to use in the kernel computation. Should be on the same scale as
the input coordinates.
metric : string or callable (default: 'euclidean')
distance function to apply over the input coordinates. Supported options
depend on whether or not scikit-learn is installed. If so, then any
distance function supported by scikit-learn is supported here. Otherwise,
only euclidean, minkowski, and manhattan/cityblock distances are admitted.
kernel : string or callable (default: 'gaussian')
kernel function to apply over the distance matrix computed by `metric`.
The following kernels are supported:
- triangular:
- parabolic:
- gaussian:
- bisquare:
- cosine:
- boxcar/discrete: all distances less than `bandwidth` are 1, and all
other distances are 0
- identity/None : do nothing, weight similarity based on raw distance
- callable : a user-defined function that takes the distance vector and
the bandwidth and returns the kernel: kernel(distances, bandwidth)
k : int (default: None)
number of nearest neighbors used to truncate the kernel. This is assumed
to be constant across samples. If None, no truncation is conduted.
ids : numpy.narray (default: None)
ids to use for each sample in coordinates. Generally, construction functions
that are accessed via W.from_kernel() will set this automatically from
the index of the input. Do not use this argument directly unless you intend
to set the indices separately from your input data. Otherwise, use
data.set_index(ids) to ensure ordering is respected. If None, then the index
from the input coordinates will be used.
p : int (default: 2)
parameter for minkowski metric, ignored if metric != "minkowski".
Returns
-------
W
libpysal.weights.experimental.W
""" This in theory supports also DistanceBand and KNN but it may be worth exposing those as separate constructors for the sake of user friendliness. We can pipe those through this internally. @classmethod
def from_triangulation(
cls,
data,
method="delaunay",
bandwidth=np.inf,
kernel="boxcar",
clip="extent",
contiguity_type="rook",
):
"""_summary_
Parameters
----------
data : numpy.ndarray, geopandas.GeoSeries, geopandas.GeoDataFrame
geometries containing locations to compute the
delaunay triangulation. If a geopandas object with Point
geoemtry is provided, the .geometry attribute is used. If a numpy.ndarray
with shapely geoemtry is used, then the coordinates are extracted and used.
If a numpy.ndarray of a shape (2,n) is used, it is assumed to contain x, y
coordinates.
method : str, (default "delaunay")
method of extracting the weights from triangulation. Supports:
- delaunay
- gabriel
- relative_neighborhood
- voronoi
bandwidth : _type_, optional
distance to use in the kernel computation. Should be on the same scale as
the input coordinates, by default numpy.inf
kernel : str, optional
kernel function to use in order to weight the output graph. See
:meth:`W.from_kernel` for details. By default "boxcar"
clip :str (default: 'bbox')
An overloaded option about how to clip the voronoi cells passed to
cg.voronoi_frames() when method="voronoi. Ignored otherwise.
Default is ``'extent'``. Options are as follows.
* ``'none'``/``None`` -- No clip is applied. Voronoi cells may be
arbitrarily larger that the source map. Note that this may lead to
cells that are many orders of magnitude larger in extent than the
original map. Not recommended.
* ``'bbox'``/``'extent'``/``'bounding box'`` -- Clip the voronoi cells to
the bounding box of the input points.
* ``'chull``/``'convex hull'`` -- Clip the voronoi cells to the convex hull
of the input points.
* ``'ashape'``/``'ahull'`` -- Clip the voronoi cells to the tightest hull
that contains all points (e.g. the smallest alphashape, using
``libpysal.cg.alpha_shape_auto``).
* Polygon -- Clip to an arbitrary Polygon.
contiguity_type : str, optional
What kind of contiguity to apply to the voronoi diagram when ethod="voronoi.
Ignored otherwise. Supports "rook" and "queen", by default "rook".
Returns
-------
W
libpysal.weights.experimental.W
""" @ljwolf I think that for voronoi, we don't need to expose Was there any discussion on these last time that concluded that this is not an optimal API? |
yeah, lets discuss the tradeoffs on friday. Given the initial layout in the I can see arguments both ways. It feels natural to do |
Got it. Let's pick it up on Friday. We may keep the signature and just change the names to something more fitting eventually. |
So, on the sprint call, we considered two following APIs: all constructors "live" on the
|
On the coincident points issue identified in Delaunay, we need to be consistent about how the new implementation deals with coincident points. Naively, Qhull implements a bunch of options. but we need to define a consistent way to deal with coincident points in KNN, too. |
For handling coincident points, we discussed the following options: jitter input upon coincidencerandomly permute/jitter the input points if there is a coincident point construct a cliquereduce the dataset to unique locations, construct the weight, and then introduce a "clique" into the graph connecting all coincident points together and also to other neighbours according to the construction rule. raise every time that there is a coincident pointdetect that there is a coincident point and raise an error if that is the case. @knaaptime @sjsrey @martinfleis @TaylorOshan @rongboxu @ljwolf agree that the implementation should generally follow the "raise" strategy. All weights constructors should take a "coincident" argument that can take three values:
|
I've just realised: for islands, the We might want to think about using |
|
yep |
id ordering not respected when roundtripping through sparse https://gist.github.com/knaaptime/d17b2d233c663aeec3e0e95487910560 |
@knaaptime regarding this roundtripping - the distance band with isolates or without isolates works fine. The Graphs are equal, the sorting within a single focal does not matter. Sparse is also equal based on both (original and roundtripped). The issue with KNN has nothing to do with sorting or roundtripping but with duplicated point locations. KNN builder drops 4 observations. That will be partially solved by the application of #548 to KNN. I will also add length checks to raise when the ids != length of the sparse. |
More generally on sorting - I believe that the only sorting that matters is the position of the first observation of each focal. We only need to ensure that the arrays are sorted in a way that ensures that Edit: This works under an assumption that every observation is present in |
exactly. We always need to respect the index of the original input, which means sometimes we are forced to sort. That's where i ended up in #513 too, so i just used a reindex on those ids at both levels, which is the same as keeping id2i around. Groupby ops will break that ordering, so we need to be able to reindex by the focal ordering arbitrarily (which is why i let sort_neighbors stick around). That way 'the position of the first observation of each focal' is always respected I was going directly from adj to sparse, so had to keep track of the column indices as well otherwise you hit the pivot issue |
I think that, insofar as we have used the term, "sorting" refers to re-ordering the weights into an arbitrary order with respect to the input data. Our goal of "no sorting" should be more clearly defined as: output sparse matrices should always conform to the order of the input data. I don't think column ordering/tail ordering can be guaranteed unless ids are provided? |
isnt this what im saying?
i dont think im following. If no ids are provided, then the ordering of the columns is the range index of the rows? (if no ids are provided, then the index is the id?). if we always know that both rows and colums of sparse are ordered in the canonical sorting given by the rows of the input, isnt that ideal? |
if we always sort both levels of the adjlist by the order given in the input data, then its in the same conceptual structure as a sparse matrix with both rows and cols arranged i through j. That way, when you pivot from table to sparse everything is in the right place without needing to specify row/col labels |
If we create This means that we either expose this on a subset of constructors (like |
right we can always create id2i from the input data, so the adjlist should always respect that order. In the current structure, the adjlist is sometimes random, but sparse is always ordered according to id2i. I think thats inconsistent and its better to just say id2i is the sort order. And i think its much easier to rely on pandas indices for handling indexing rather than rolling from scratch we're not posing any assumptions on the input data on my implementation, we're just saying the adjlist order should match the sparse order |
so, since id2i is static because it needs to match the input order, it means sparse is static, so why is adjlist arbitrary? That just creates one more index we need to map between. Lets just use a single index, since its the only one that matters (which was the original idea of using a sparse multiindex series) |
So you would enforce sorting of the adjacency and drop |
yeah, i think this is my point
i think so, but maybe we stash it for convenience. I think id2i forces us to consider that there is an inherent order to the data The whole idea of using a data frame as the central structure is to avoid doing the mapping and indexing ourselves. Why in the world would we let the adjlist be random when we're using the indices to provide structure? if id2i is structured, and sparse is structured, but adjlist is random (at the neighbor level), then how is that the 'canonical' representation of the graph? Lets just get it into the correct shape once, and let
and i think maybe this is the crux of it. We're still struggling with whether the graph is ordered or not. Since the ordering of the rows matters (needs to match input order for things like Since formal order matters in sparse, why allow an arbitrarily-ordered one in |
I think that this boils down to what "no sorting" means. It can mean "no sorting is expected from adjacency" or "no sorting happens during internal operations". The former requires id2i and sorting when casting to sparse, the latter requires sorting on initialisation. Neither is fully "no sorting" but the latter may be preferable. |
Also, sorting on init can not be just |
yes, well said. I think we're sorting no matter what, so 'explicit is better than implicit' and i'd prefer to do it on init (and say we don't sort after)
so on init, i would do something like this. Set the index as ['focal','neighbor'], then reindex both levels by geometry index and reset the index ...and luckily, this is what raised the issue 😁. So instead of having this happen only in |
One thing this doesn't solve is the existence of |
ah snap. bummer. but i think thats ok... lets keep id2i as a private attribute and use that as our 'reindexer'? so, without changing much code: |
Probably It is also used as inverse i2id in stuff that is computed on sparse, like asymmetry. We could do them on the fly based on the sorted index but then a private cached property as we have now is probably a better solution. |
the way i spec'd it, the adjlist is the source of truth. So a private attribute stores the sort order from the first level of the multiindex, and that defines the canonical order in all of [adjlist, neighbors, weights, sparse]. I think thats preferable because the canonical order is always consistently defined across representations (based on the core data structure). In that case, we're saying a property of the Graph is that it's always ordered with respect to the input data, and that order is defined by the adjacency table. Since that's already a requirement of some representations, it's better to just standardize the convention and adopt it throughout |
Sounds good to me. Feel free to go ahead and give it a go. I won't be able to work on it much this week. |
a few more updates in the gist tldr, do we want to allow the builders to accept a sparse matrix as input when
if we need an array, cant we do |
@knaaptime that 3-4 second timing pretty wild... |
The |
ok, i figured that was the goal. I think both _knn and _distanceband are also setup to take sparse without much effort, they just need the same check as _kernel that bypasses the geomvalidator when sparse is input one thing i noticed just now is the default kernel for the graph is gaussian but for W it was triangular, so we might want to surface that |
@ljwolf Any reason to not default to the same kernel? |
Nope! Minor mathematical reasons, scikit generally defaults to Gaussian kernels, but I'm definitely not wedded to it! |
I'm starting this thread to begin discussions about the upcoming weights sprint. The current experimental implementation is in the
geographs
branch, theweights/experimental
module. TheROADMAP.md
will need updating...The text was updated successfully, but these errors were encountered: