Skip to content

SMEP: redesign of the KDE modules

Padarn Wilson edited this page Aug 28, 2014 · 24 revisions

SMEP: redesign of the KDE modules

Status: Design

Purpose

The current design of the KDE is not flexible enough to allow for inserting various algorithms, for instance to allow for boundary correction.

The purpose of this document is to provide a structure for the KDE module to allow for custom algorithms and easier experimentation later on. It should also allow the integration of the algorithms currently implemented in the kde module available on PyPI in the pyqt_fit package.

The design should also allow for efficient implementation and should keep the current principles shared among the various fitting methods offered by statsmodels. In particular, it seems that statsmodels as a multi-process job mechanism, and it should be possible to use it.

To Discuss: In the current version of statsmodels the parametrization of algorithms using strings and dispatch tables is fairly ubiquitous. I would suggest to use a more pythonic approach and simply use objects with well defined API for all parts. To my mind, this has three advantages:

  1. this will force us to define and publish the interface, which is good for documentation
  2. this will make it easy for users to provide their own version of the algorithms if they want to
  3. IDE/shells can use objects more readily than strings for completion or for presenting help

Comments (josef-pkt): We uses classes or instances as arguments heavily in other models, mainly RLM, GLM and GEE. In some cases for distributions, I (or we in scipy) use both where the user can either specify a string or a callable. (I just used again dispatch based on string arguments in the covariance choices in the models, and didn't think about allowing functions or classes.)

I don't see any problem using classes as arguments for the kernels, but I don't know whether, in the case of bandwidth choices the cross-validation, least squares and maximum likelihood methods are not too tightly integrated into the models to allow for bandwidth classes or functions that can be easily swapped.

Comments (PierreBdR): I will have a look at these methods. As I understand it (before examining the code itself), the cross-validation might need to fit the model, maybe more than once. But I don't see that as a real problem.

Main usage of the module

As I see it, a simple session should look like this:

>>> import numpy as np
>>> from statsmodels.api.nonparametric import kde
>>> X = np.exp(np.randn(1e6))  # Log-normal distribution
>>> model = kde.KDE(X)
>>> est = model.fit()
>>> xs = np.r_[0:10:1024j]
>>> ys = est(xs) # Estimate the KDE on a defined set of points
>>> YS = est.cdf(xs) # Estimate the cumulative distribution function
>>> xs, ys = est.grid(2**20)  # Estimate the PDF on a grid so as to be as efficient as possible
>>> _, Ys = est.cdf_grid(2**20) # Estimate the CDF on a grid -- xs should be the same as for the PDF

If we want to specify some particular estimation method, for example assuming we have an object log_transformed that would estimated the distribution in log-space:

>>> model.lower = 0  # Set the lower bound to 0 (otherwise the log doesn't make sense)
>>> model.method = log_transformed
>>> est = model.fit()
>>> xs, ys = est.grid(2**20)  # The grid is likely to be non-uniform

Comment (josef-pkt): In general we avoid setting attributes to influence the behavior. The main problem that we often have a hard time to avoid is "stale state". When an option has been changed, then we would need to keep track of which attributes will not be appropriate anymore. Our general solution was to put all options that influence the "state" of the models into the __init__, and if users want to change options, then they need a new instance which can hold a different "state".

Comment (PierreBdR): As I see it, the KDE class is nothing more than a place to set up carefully all the parameters you need. The fit method examines them all, and creates an appropriate estimator object. The estimator itself is read-only, so there is no options to corrupt the data. I like the attribute setting paradigm which avoid function calls with dozens of arguments, and also allows helper functions that would set parameters consistently for a given purpose.

Comment (josef-pkt) It depends on whether attribute setting can mess up the state. For example, in analogy to our other models, if the FittedKDE class holds on to the underlying KDE class and needs to get information from it, then a user that is setting an attribute after fit might cause incorrect 'state'. FittedKDE wouldn't access the information that was used in it's fit anymore but instead some arbitrarily changed information. Avoiding attribute setting makes it easier for users, because they don't have to learn which attributes can be set at which time, and which would make the results inconsistent. If the FittedKDE class does not hold on to the underlying KDE class, then there would be no 'state', attributes that could become inconsistent. (Aside: I spent many many hours hunting bugs based on some incorrectly changed or stale state, both in scipy.stats.distributions and statsmodels, so I'm very conservative and cautious in this. "defensive programming") Fitted parameters can then be accessed as attributes:

>>> print("Bandwidth = ", est.bandwidth)
>>> print("method = ", est.method) # Method really used, may be different than the one selected!

Comment (Padarn) I like PierreBdR's design concept, but I do think it depends on how users are planning on using the KDE classes. This way of setting up KDE objects makes a lot of sense if multiple instances are created, otherwise it seems somewhat unnecessary.

If we do it this way, we will want to be careful that simple changes to the kde (changing the bandwidth manually for example) are efficient, and the KDE class can use the previous solution stored in a FittedKDE.

Comment (PierreBdR) I think the FittedKDE object shouldn't hold a reference to the original KDE object, but rather a shallow copy of what it needs. This rises the question of deep'copy of the arguments at the level of the KDE, but we'll see that later. As for efficient modifications of the parameters, I will offer a first implementation in which the method gets to decide what can be efficiently changed. This might be dangerous, but what can be changed depends on what is pre-computed during fitting. SO I will try and we'll see if that's a good thing of not.

Comment (Padarn) I'm having a hard time envisioning a use case that makes this extra layer of complexity necessary personally.. but my use of the density estimates has been fairly basic to be honest. Will be interested to see how what you're describing look.s

Functions that are needed

Of course, we should at least offer the same functions as are available right now. This include:

  • selection of the bandwidth algorithm
  • selection of the kernel used
  • estimation of the CDF, cumulative hazard function, entropy, inverse CDF and survival function
  • allow weighting of the dataset
  • point evaluation (i.e. on points defined by the user)
  • adjustment of the bandwidth

I would like to add what is handled by pyqt_fit, namely:

  • local adjustment of the bandwidth (i.e. each point may have a different bandwidth)
  • choice of the grid size for efficient evaluation at evaluation time
  • definition of the boundaries of the domain
  • select the method used to evaluate the PDF

Of course, we should strive to get as efficient an implementation as possible. This is why I would like to have the fit method returned a new object, so we have the option to switch method based on data properties. That would allow for a single KDE class that would selected univariate algorithms when appropriate.

Design of the modules

Note: I understand that the design I am suggested is heavily influenced by the one of pyqt_fit. If you disagree with it, don't hesitate to discuss about it.

The main module would still be nonparametric.kde. It should contain the KDE class that is responsible for gathering all the data needed to perform the fitting. This means:

  1. Definition of the domain supporting the density (e.g. lower and upper bounds)
  2. The kernel to use
  3. The bandwidth estimation method or value
  4. The density estimation method

Fitting the KDE object will provide a FittedKDE object that will forward the calls to the estimation method in order to provide a unified interface to the user and to the method. This means that users will be able to use flexible notations (i.e. single numbers, lists, numpy array, ...) but the method will be guaranteed to receive valid, indexable ndarray for everything. There might be more than one FittedKDE class. For example, we might want to specialize the class for univariate estimation.

comment josef-pkt: We still have additionally the multivariate kde, with a different emphasis. So naming and structure should take this into account. Plus it would be possible to add a special 2d kde that could still efficiently use fft.

Comment (Padarn) With regards to the "density estimation methods", do you have some standard references? I've come across these transformation methods plenty, but only in scattered papers. I haven't got a good solid reference that collects them together.

Comment (PierreBdR) No, I don't have a reference. But the maths behind aren't really complicated and they're all here: http://pyqt-fit.readthedocs.org/en/latest/mod_kde_methods.html#pyqt_fit.kde_methods.TransformKDE1DMethod

Note: We have to be careful on how we define the fitting. For example, the bandwidth estimation may need to be done not on the original data, but on some transform of the data.

Kernel module: nonparametric.kernels

The kernels should provide simple kernels, of fixed variance and mean. I think we shouldn't be shy to add methods to the kernels as long as we can provide default numerical estimations based on the density itself. For example, in pyqt_fit, the univariate kernels are only required to provide the PDF of the kernel, but may also provide:

  • the upper and lower bound for the kernel
  • an efficient cutting value for this kernel, i.e. a value c such that cdf(c) - cdf(-c) ~ 1
  • the cumulative distribution function
  • the first partial moment
  • the second partial moment
  • the FFT of the kernel
  • the DCT of the kernel

But by inheriting the Kernel1D class, all of these methods but the bounds and cutting values are estimated from the PDF.

Note: In other places in the statsmodels, kernels also have their order set.

Bandwidth estimation: nonparametric.bandwidths

The bandwidth estimation should be a simple object (or function) taking the data, the domain and the kernel. Of course the bandwidth estimator may have parameters on its own.

Note: I cannot think of anything else needed, but I don't have too much experience on these.

Density estimation method: nonparametric.kde_methods

This is where the magic really happen. To simplify code, the method can assume they will always be called through a proxy, and therefore always receive well-formed arrays. Other than that, they should implement at least the PDF method, the rest can be estimated numerically. I did some work to implement at least the same things as statsmodels and currently my methods offer:

  • pdf
  • cdf
  • inverse cdf
  • survival function
  • inverse survival function
  • hazard function
  • cumulative hazard function

All these methods exist in two flavours: by points or by grid. Also, the base class for estimation provides a numerical approximation for all these methods, so it is sufficient to implement point-wise estimation of the PDF to have all the other functions (although it may be slow).

A word on grids and binning

The grid-based estimation methods can often make use of a good binning method. As such, it will be profitable to provide a separate binning class that can be accessed by the method when they need it. The class should take a series of points and weights, boundaries, and a number of bins, and whether the domain is simply bounded or cyclic. It should return a table of bins and for each bin the position of its centre.

Caveats

We need to be extra careful not to prevent future algorithms. In particular for the bandwidth calculation, and for fancy estimation methods that might transform the data, even before the bandwidth is estimated.

In particular, we want to be able to implement:

  • TransformKDE -- The KDE is computed in a different space
  • Botev et al. 2010 [1] -- A different KDE, that in the end compute a bandwidth for each point: http://arxiv.org/pdf/1011.2602.pdf
  • Cross-validation estimator, that needs the KDE estimator

Copy of the data

There is an important question about deep vs. shallow copy of the data. Obviously, shallow copy is much much more efficient. But deep copy is a lot safer. For now, I will use shallow copy whenever possible. I'll make sure copy is always explicit, so it can be changed later (e.g. using property setters).

Shape of multi-dimensional arrays

Most of scipy statistical functions and statsmodels use Fortran ordering for arrays, which means the last index is the one selecting the variable.

There are two significant exceptions: statsmodels.nonparametric.KernelReg and scipy.stats.gaussian_kde. The former should probably be corrected, the latter reported.

For example, to estimate the pdf on points (0,0), (-1,-2) and (2,3) you would use:

>>> est.pdf([[0,0], [-1,-2], [2,3]])

Using the last index for the variable, allows to use the first indexes to shape the output. For example in a 2D problem, one could submit a 11x11x2 array as a grid and receive a 11x11 array answer, keeping the grid, with a code like:

>>> est.pdf(transpose(meshgrid(r_[0:10:11j],r_[0:10:11j])))

Existing code

statsmodels

In statsmodels, the current version of the estimator is in the nonparametric module. In particular:

  • the kde module contains the main UnivariateKDE object, acting both as interface and implementation of some functions. It also contains the estimation functions, in particular the point-wise and FFT-based one.
  • the kernels module contains the various kernels. They are enumerated in the kde module in a dispatch table to match strings to object.
  • the linbin module defines the linear binning method
  • the kde_methods module defines a set of helpers
  • the bandwidths module define bandwidth estimators, including a dispatch table to match strings to objects.

pyqt_fit

I am obviously more familiar with this package. Currently, the KDE object also work as proxy for the method. The organization is such that:

  • the KDE1D object is in the kde module
  • the kernels are in the kernels module
  • the bandwidth selection methods are in the kde_bandwidth module
  • the density estimation methods are in the kde_methods module

There are also some helpers:

  • _kernels.pyx contains a Cython implementation of the kernel functions, while _kernel_py.py has the pure-python implementation
  • cy_binning.pyx contains the Cython implementation of linear binning, adapted from the statsmodels code itself, while py_binning.py contains a simpler, pure-python, binning method.
  • math.pxd contains definitions of libc math functions that I needed and where not exported by default

Performance

Here is a gist with a performance test: https://gist.github.com/PierreBdR/90e7f6dd72de13dcf9ee

On my computer I obtain the following output:

n= 10000000.0 average time v1: 5.46428132057 seconds
n= 10000000.0 average time v2: 2.3874446551 seconds
n= 10000000.0 average time v3 (fft): 2.00666666031 seconds
n= 10000000.0 average time v4 (dct): 2.16459933917 seconds
bw1 = 0.0421814365282 -- bw2 = 0.0421814365282 -- bw3 = 0.0421814365282
x -> [-5.3790708367 ; 5.75612642225]
Absolute error iy1 -- y2 ~ 2.23230792331e-12
Relative error iy1 -- y2 ~ 4091.7609555

Absolute error iy3 -- y2 ~ 1.98063444022e-13
Relative error iy3 -- y2 ~ 9.26272873809e-06

Absolute error iy4 -- y2 ~ 6.52985430083e-15
Relative error iy4 -- y2 ~ 9.26148008014e-06

As such, the implementation of pyqt_fit should be an improvement in term of speed, whether the FFT or CDT is used (unless I missed something). It is worth noting that this performance has been achieved thanks to the efficient Cython binning I wrote based on the statsmodels implementation of @Padarn.

Documentation and tutorial

The existing documentation in statsmodels is pretty minimal. I will offer to adapt the documentation and tutorial I wrote. The tutorial is here: http://pyqt-fit.readthedocs.org/en/latest/KDE_tut.html while the modules are there:

References

[1] Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density estimation via diffusion. The Annals of Statistics, 38(5):2916-2957, 2010.

Clone this wiki locally