Adaptation to sklearn #143

MuellerSeb · 2020-03-26T17:01:39Z

sklearn provides a lot of functionality, we could use to simplify our code.

Distance matrix calculation:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html

KDtree:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html

Search radius to optimize moving window kriging (See: #57):
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query_radius

Nearest neighbors:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query

Beside that, one can specify the metric in use (see: #120):
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html

“euclidean” | EuclideanDistance | | sqrt(sum((x - y)^2))
“manhattan” | ManhattanDistance | | sum(|x - y|)
“chebyshev” | ChebyshevDistance | | max(|x - y|)
“minkowski” | MinkowskiDistance | p | sum(|x - y|^p)^(1/p)
“wminkowski” | WMinkowskiDistance | p, w | sum(|w * (x - y)|^p)^(1/p)
“seuclidean” | SEuclideanDistance | V | sqrt(sum((x - y)^2 / V))
“mahalanobis” | MahalanobisDistance | V or VI | sqrt((x - y)' V^-1 (x - y))

Or for geo-coordinates (see: #121):

“haversine” | HaversineDistance | 2 arcsin(sqrt(sin^2(0.5dx) + cos(x1)cos(x2)sin^2(0.5dy)))
(np.deg2rad needed here)

So this could solve a lot of issues.

rth · 2020-03-26T17:34:49Z

Yes, I think that would be a good idea. I'm all for better integration with scikit-learn in general #31

When improving the API for 2.0, making one that's compatible with scikit-learn if possible would be useful https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator

MuellerSeb · 2020-03-26T18:38:32Z

@rth Could you draft a general workflow for the integration with sklearn? I have to learn a bit more about sklearn to get to know the dos and don'ts..

rth · 2020-03-26T18:49:09Z

Essentially one could start with minimal functionality along the lines of,

from sklearn.base import BaseEstimator

class OrdinaryKrigging(BaseEstimator):
    def __init__(self, ...):
        ...
 
    def fit(self, X, y):
        ...
    
    def predict(self, X, return_std=False):
        ...

where X is (n_samples, n_features) array (with for now n_features = 2 or 3 being supported only).

Looking at the usage of GaussianProcessRegressor in scikit-learn could be a start. Then one would progressively add features, and form time to time run check_estimator to make sure the API is compatible.

We already have a scikit-learn wrapper class BTW

PyKrige/pykrige/rk.py

Line 27 in 40a8140

class Krige(RegressorMixin, BaseEstimator):

, but the idea could be to make this more of a default API.

MuellerSeb · 2020-03-26T23:13:09Z

So we should have a look at:
https://github.com/scikit-learn-contrib/project-template

rth · 2020-03-26T23:19:21Z

So we should have a look at: scikit-learn-contrib/project-template

Worth looking into but there are also things outdated there.

MuellerSeb · 2020-03-26T23:29:39Z

OK. In the end I am thinking of a single class for kriging where the dimension is just a parameter. And since Ordinary kriging is just universal kriging with a constant drift (or simply an unbiased estimator), we could combine all 4 kriging classes that are present at the moment.
When providing these in the proposed sklearn way, we can even get rid of the RK class. Which sounds nice to me.

You once mentioned that anisotropy and rotation could be provided by pipelines (#138).
How does that work?

rth · 2020-03-27T11:39:40Z

OK. In the end I am thinking of a single class for kriging where the dimension is just a parameter. And since Ordinary kriging is just universal kriging with a constant drift (or simply an unbiased estimator), we could combine all 4 kriging classes that are present at the moment.
When providing these in the proposed sklearn way, we can even get rid of the RK class. Which sounds nice to me.

Yeah, the implementation could be a single class for kriging. The dimension could just be determined from the X.shape[-1] passed to fit method. Though having OrdinalKriging / UniversalKriging aliases still might not hurt (if only so that people find them when searching for those key words online). One of the current issue is that the classes have too many parameters.

You once mentioned that anisotropy and rotation could be provided by pipelines (#138). How does that work?

The idea is that if some of those calculations, are stand alone transformations of coordinates (i.e. X (n_samples, n_features), where features = ['x', 'y']) converted to X_transformed with the same shape, then it could be a stand-alone pre-processing class. And pipeline could be built using make_pipeline (see docstring example in the link) from scikit-learn. This allows to make things more modular. For instance,

from sklearn.pipeline.make_pipeline

pipe = make_pipeline(Preprocessor(), OrdinalKriging())
pipe.fit(coords, target)
z = pipe.predict(coords_grid)

where internally for pipe.fit, it would first call preprocessor.fit, then preprocessor.transform to get the transformed coordinates, and finally krigging.fit on these coordinates.
Then for pipe.predict it would first call reprocessor.transform then krigging.predict on those transformed coordinates.

This logic is built-in in scikit-learn pipelines, and it's possible to chain muliple processing steps like this. And also combine estimators from different packages, assuming they follow the scikit-learn API.

I haven't done a detailed analysis to see to what extent it's possible in pykrige. One limitation of scikit-learn pipelines is that the target variable (previously known as z) cannot be transformed in pipeline steps. But there are workarounds if it's really a blocker.

MuellerSeb · 2020-03-28T11:51:13Z

One question I have about the pre-processor is: What should be fitted?
Only thing we want is, to transform the coordinates with the rotation-angles and anisotropy ratios given (we can't fit these, since we would need to fit a spatial variogram to the data to achive this, which is quite hard for a sparse data set)

So we would only use transform with fixed parameters (angles, ratios), right?

rth · 2020-03-28T12:23:18Z

One question I have about the pre-processor is: What should be fitted?
Only thing we want is, to transform the coordinates with the rotation-angles and anisotropy ratios given (we can't fit these, since we would need to fit a spatial variogram to the data to achive this, which is quite hard for a sparse data set)

If fit does nothing, it's fine. But it's still needs to be there for API compatibility,

from sklearn.base import BaseEstimator, TransformerMixin

class Preprocessor(BaseEstimator, TransformerMixin):  # or give it a better name
    def __init__(self, parameter="some_default"):
        self.parameter = parameter

    def fit(self, X, y):
       """This estimator is stateless, fit does nothing."""
        return self
    
    def transform(self, X):
       # actual transformation here
       return X_transformed

The base classes are necessary to,

BaseEstimator marks it as compatibile with scikit-learn API and also defines set_params / get_params methods, to set/get parameters passed in __init__,
TransformerMixin marks it as a transformer, and defines,

def fit_transform(X, y):
    return self.fit(X, y).transform(X)

which in this case is equivalent to just a transform.

MuellerSeb · 2020-04-06T22:45:02Z

I was digging through the sklearn code and it seems, that you can't use the kdtree directly in cython (link). Since it would be good to prevent the python-overhead in creating the distance matrix, I would suggest forking the critical parts of sklearn.neighbors (link):

_binary_tree.pxi - Base class for the kdtree
_kd_tree.pyx - kdtree itself
_dist_metrics.pyx - the provided distance metrics
_dist_metrics.pxd - header file for the distance metrics
_typedefs.pyx - type definitions
_typedefs.pxd - header file for type definitions

Only thing that needs to be modified, is commenting out from ..utils import check_array, and all occurrences of check_array in _binary_tree.pxi (4 times), since we can take care of the input our selfs (needs to be np.float64 array). Then, the 6 files above should be enough. We only need to create a header file _kd_tree.pyd (which is not present in sklearn, that's why the effort), to use the kdtree directly in cython.

advantages:

we don't need to create the full distance matrix in python (which can be huge for big data sets (~10000 points)
when implementing the moving window, we can then simply:
1. loop over the points in cython
2. get all nearest neighbors
3. create the kriging matrix directly in cython
4. solve the kriging equation in cython (even parallelizable with prange and openmp)

ATM the kriging matrix is created in python and passed to cython (when doing the full kriging). With n nearest neighbors, a list of points is passed to cython to cut out the reduced kriging matrix (again, the full matrix is present)

The idea would be, to create a companion class to the future kriging class in cython, that holds all necessary information and does the number-crushing. All we need is the variogram in cython, which is already present.

We could also cut down the functionality of the kdtree to only provid:

query_radius - for moving window
query - for n nearest neighbors (which is provided by pykrige ATM)

So we can keep the external code base at a minimum.

@rth, @bsmurphy what do you think?

rth · 2020-04-07T10:27:15Z

Do you think there is no way of doing this without using KDTree/Balltree from cython? E.g. computing the calculation in small batches or using sparse distance matrices. See e.g. KNeighborsTransformer and the Aproximate nearest neighbours example.

I would suggest forking the critical parts of sklearn.neighbors

That code is legacy and quite convoluted (cf e.g. scikit-learn/scikit-learn#4217) I would not recommend forking (and maintaining) it unless there is no way around it.

is commenting out from ..utils import check_array, and all occurrences of check_array in _binary_tree.pxi (4 times), since we can take care of the input our selfs (needs to be np.float64 array).

There is a SLEP 10 to replace check_array with _validate_data method in scikit-learn (which can be overwritten in contrib projects). It was partially merged scikit-learn/scikit-learn#16112 and will be available in the next release (in ~1 month or so). If some estimators are not yet done it should be fairly easy to add them.

MuellerSeb · 2020-04-07T10:48:50Z

If we can't use the KDTree in cython, the loop over the gird-points needs to be done in python, or the full-distance matrix needs to be constructed and passed to cython to do the loop there. In the first case, we get a massive overhead in time, in the second case we get a massive overhead in memory consumption.

It seems, that the sklearn team doesn't want to provide kdtree for cython, since they think, it is to much of an maintainance overhead.

I'll think about it.

rth · 2020-04-07T10:52:56Z

or the full-distance matrix needs to be constructed and passed to cython to do the loop there [...] in the second case we get a massive overhead in memory consumption.

Not if we use a sparse distance matrix (that only stores K nearest neigbours for each point).

It seems, that the sklearn team doesn't want to provide kdtree for cython, since they think, it is to much of an maintainance overhead.

You could try opening an issue about it there, but I do anticipate mostly negative feedback indeed due to maintenance costs.

MuellerSeb · 2020-04-07T13:42:24Z

If we use only K nearest neighbors, we still can't provide a search radius for the moving window. K-nearest neigbors always depend on the densitiy of the input data.

rth · 2020-04-07T13:48:31Z

If we use only K nearest neighbors, we still can't provide a search radius for the moving window. K-nearest neigbors always depend on the densitiy of the input data.

I think it should also be possible to create a sparse distance matrix for neighbors within a given radius: i.e. a KNeighborsTransformer but that uses radius_neighbors_graph internally instead of kneighbors_graph from NearestNeighbors. But I guess not both of those at the same time.

MuellerSeb added enhancement new feature Refactoring labels Mar 26, 2020

MuellerSeb added this to the v2.0 milestone Mar 26, 2020

MuellerSeb self-assigned this Mar 26, 2020

MuellerSeb pinned this issue Mar 26, 2020

This was referenced Mar 26, 2020

Getting kriging interpolated grid with None values where the samples are far from the semivariogram range #134

Closed

Strange number in a variogram function #119

Open

MuellerSeb mentioned this issue Mar 26, 2020

Improve computing time and memory consumption of ordinary Kriging GeoStat-Framework/GSTools#50

Closed

MuellerSeb changed the title ~~use KDtree from sklearn~~ Adaptation to sklearn Mar 28, 2020

MuellerSeb added this to To do in PyKrige v2 via automation Apr 4, 2020

MuellerSeb moved this from To do to Selected in PyKrige v2 Apr 4, 2020

MuellerSeb added this to I14Y in GS-Framework v2 May 1, 2020

MuellerSeb mentioned this issue Aug 18, 2020

Perform Local Kriging by specifying maximum number of nearest observations GeoStat-Framework/GSTools#99

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptation to sklearn #143

Adaptation to sklearn #143

MuellerSeb commented Mar 26, 2020 •

edited

rth commented Mar 26, 2020

MuellerSeb commented Mar 26, 2020

rth commented Mar 26, 2020 •

edited

MuellerSeb commented Mar 26, 2020

rth commented Mar 26, 2020

MuellerSeb commented Mar 26, 2020

rth commented Mar 27, 2020 •

edited

MuellerSeb commented Mar 28, 2020

rth commented Mar 28, 2020 •

edited

MuellerSeb commented Apr 6, 2020 •

edited

rth commented Apr 7, 2020 •

edited

MuellerSeb commented Apr 7, 2020

rth commented Apr 7, 2020

MuellerSeb commented Apr 7, 2020

rth commented Apr 7, 2020

Adaptation to sklearn #143

Adaptation to sklearn #143

Comments

MuellerSeb commented Mar 26, 2020 • edited

rth commented Mar 26, 2020

MuellerSeb commented Mar 26, 2020

rth commented Mar 26, 2020 • edited

MuellerSeb commented Mar 26, 2020

rth commented Mar 26, 2020

MuellerSeb commented Mar 26, 2020

rth commented Mar 27, 2020 • edited

MuellerSeb commented Mar 28, 2020

rth commented Mar 28, 2020 • edited

MuellerSeb commented Apr 6, 2020 • edited

rth commented Apr 7, 2020 • edited

MuellerSeb commented Apr 7, 2020

rth commented Apr 7, 2020

MuellerSeb commented Apr 7, 2020

rth commented Apr 7, 2020

MuellerSeb commented Mar 26, 2020 •

edited

rth commented Mar 26, 2020 •

edited

rth commented Mar 27, 2020 •

edited

rth commented Mar 28, 2020 •

edited

MuellerSeb commented Apr 6, 2020 •

edited

rth commented Apr 7, 2020 •

edited