Skip to content

ArnoVel/structure-identification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Probing the Structure of Bivariate Distributions

The goal of this repository is to compare and compile a list of different statistical methods and algorithms, which take as inputs bivariate data (point clouds) and attempt to infer a causal direction.

Good references on this topic are:

  • A Very Comprehensive Benchmark of methods using Additive Noise Models, and all the surrounding concepts
  • Several machine-learning algorithms using distribution embeddings have been designed: RCC, KCDC. A more statistical approach is QCDC (copulas + quantile scores)
  • The SLOPE algorithm is a framework assuming a set of basis functions, and iteratively weights goodness of fit and function complexity to find the "best" model. Various instantiations exist such as Slope-S, Slope-D, an identifiable variant, etc... More information can be found in their journal paper
  • RECI is a statistical approach based on regression, identifiable in the low noise setting
  • IGCI Justifies a statistical approach in the case the relationship is deterministic and invertible. Additional material can be found in their subsequent paper.
  • A good review on graphical models for a number > 2 of variables can also be helpful to understand the general POV.
  • CGNN Connects graphical models, generative models, and bivariate methods in an interpretable fashion (using neural networks). It is a good bridge between bivariate and graph methods. The authors are currently building a very helpful python causal discovery library

Dependence - Independence Measures

Many causal algorithms rely on independence tests and Similarity tests. Some examples are

  • Bivariate Methods using Additive Noise Models often use Mutual Information or HSIC
  • Constraint-based methods for graph data use conditional independence tests. A good statistical test is the KCI Test and the related KPC algorithm. In case one needs a faster, approximate method, the authors (and others) have recently designed approximations such as RCIT and RCOT. Another good but quadratic complexity conditional independence test is PCIT
  • A good review on Dependence tests can be found in this interesting thesis

Here we are interested in differentiable versions of various statistical tests. We implemented some tests using PyTorch and using smooth approximations to existing tests, allowing backpropagation w.r.t each inputs/parameters.

HSIC

  • PyTorch HSIC Test and an example of HSIC minimization ( code ) for ANM-detection. Although the HSIC test is differentiable wrt all inputs, our implementation doesn't yet support hyperparameter fitting.

  • Examples of 2D gaussian HSIC-Gamma test, and ANM-detection tests will be uploaded.

  • Might re-implement relative HSIC between two models

MMD

C2ST

Classifier Two Sample Tests (C2ST) have been introduced and tested in this paper. Here, we re-implement and slightly adapt the lua code of the authors, which includes

  • C2ST-NN: using a shallow neural network classifier (ReLU + Sigmoid) with default 20 hidden units. While adding layers/hidden units is a good idea, we usually work with 500-5000 samples per distribution, and/or aim for accuracy higher than 55% to reject P=Q
  • C2ST-KNN: K-nearest neighbors classifier with k=floor(n_te/2). Usually worse than neural nets.

The idea in broad terms is that under H0 (P=Q) , the classifier cannot exceed 50% accuracy and n*acc is distributed as Binomial(n_te, 0.5). Then acc under H0 can be approximated as Normal(0.5, 0.25/n_te), we therefore use the approximate null to find a p-value on the accuracy and reject H0 accordingly.
Some basic examples can be found in this subdirectory.

Bivariate Causal Algorithms

SLOPE

We are currently re-implementing SLOPE in python, allowing both Numpy & PyTorch datatypes. An example of the SLOPE fit for 13 basis functions can be found in this folder ( code ), which also contains mixed fits for 8 functions, and a little bit more.

Distribution fittings

Flexible Gaussian Mixtures

Fit a GMM ( code ) with flexible number of components.

  • One dimensional on synthetic data (can be applied to estimate marginal complexity)
  • Two dimensional on synthetic data (as an example of causality-agnostic distribution fitting)

Experiments and Visualisations

Unless exceptions, every picture and experiment reported can be seen in the tests/data subdir. However, for particularly large files or high number of pictures, a different picture-only repo is available!

The dependencies can be installed using pip install -r requirements.txt or pip3 install -r requirements.txt