Probing the Structure of Bivariate Distributions

The goal of this repository is to compare and compile a list of different statistical methods and algorithms, which take as inputs bivariate data (point clouds) and attempt to infer a causal direction.

Good references on this topic are:

A Very Comprehensive Benchmark of methods using Additive Noise Models, and all the surrounding concepts
Several machine-learning algorithms using distribution embeddings have been designed: RCC, KCDC. A more statistical approach is QCDC (copulas + quantile scores)
The SLOPE algorithm is a framework assuming a set of basis functions, and iteratively weights goodness of fit and function complexity to find the "best" model. Various instantiations exist such as Slope-S, Slope-D, an identifiable variant, etc... More information can be found in their journal paper
RECI is a statistical approach based on regression, identifiable in the low noise setting
IGCI Justifies a statistical approach in the case the relationship is deterministic and invertible. Additional material can be found in their subsequent paper.
A good review on graphical models for a number > 2 of variables can also be helpful to understand the general POV.
CGNN Connects graphical models, generative models, and bivariate methods in an interpretable fashion (using neural networks). It is a good bridge between bivariate and graph methods. The authors are currently building a very helpful python causal discovery library

Dependence - Independence Measures

Many causal algorithms rely on independence tests and Similarity tests. Some examples are

Bivariate Methods using Additive Noise Models often use Mutual Information or HSIC
Constraint-based methods for graph data use conditional independence tests. A good statistical test is the KCI Test and the related KPC algorithm. In case one needs a faster, approximate method, the authors (and others) have recently designed approximations such as RCIT and RCOT. Another good but quadratic complexity conditional independence test is PCIT
A good review on Dependence tests can be found in this interesting thesis

Here we are interested in differentiable versions of various statistical tests. We implemented some tests using PyTorch and using smooth approximations to existing tests, allowing backpropagation w.r.t each inputs/parameters.

HSIC

PyTorch HSIC Test and an example of HSIC minimization ( code ) for ANM-detection. Although the HSIC test is differentiable wrt all inputs, our implementation doesn't yet support hyperparameter fitting.
Examples of 2D gaussian HSIC-Gamma test, and ANM-detection tests will be uploaded.
Might re-implement relative HSIC between two models

MMD

PyTorch MMD Test with Gamma Approximation.
Might re-implement optimized MMD from here or relative MMD between two models

C2ST

Classifier Two Sample Tests (C2ST) have been introduced and tested in this paper. Here, we re-implement and slightly adapt the lua code of the authors, which includes

C2ST-NN: using a shallow neural network classifier (ReLU + Sigmoid) with default 20 hidden units. While adding layers/hidden units is a good idea, we usually work with 500-5000 samples per distribution, and/or aim for accuracy higher than 55% to reject P=Q
C2ST-KNN: K-nearest neighbors classifier with k=floor(n_te/2). Usually worse than neural nets.

The idea in broad terms is that under H0 (P=Q) , the classifier cannot exceed 50% accuracy and n*acc is distributed as Binomial(n_te, 0.5). Then acc under H0 can be approximated as Normal(0.5, 0.25/n_te), we therefore use the approximate null to find a p-value on the accuracy and reject H0 accordingly.
Some basic examples can be found in this subdirectory.

Bivariate Causal Algorithms

SLOPE

We are currently re-implementing SLOPE in python, allowing both Numpy & PyTorch datatypes. An example of the SLOPE fit for 13 basis functions can be found in this folder ( code ), which also contains mixed fits for 8 functions, and a little bit more.

Distribution fittings

Flexible Gaussian Mixtures

Fit a GMM ( code ) with flexible number of components.

One dimensional on synthetic data (can be applied to estimate marginal complexity)
Two dimensional on synthetic data (as an example of causality-agnostic distribution fitting)

Experiments and Visualisations

Unless exceptions, every picture and experiment reported can be seen in the tests/data subdir. However, for particularly large files or high number of pictures, a different picture-only repo is available!

The dependencies can be installed using pip install -r requirements.txt or pip3 install -r requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pycache

pycache

causal

causal

dependence

dependence

fitting

fitting

functions

functions

tests

tests

init.py

init.py

readme.md

readme.md

requirements.txt

requirements.txt

Repository files navigation

Probing the Structure of Bivariate Distributions

Dependence - Independence Measures

HSIC

MMD

C2ST

Bivariate Causal Algorithms

SLOPE

Distribution fittings

Flexible Gaussian Mixtures

Experiments and Visualisations

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
__pycache__		__pycache__
causal		causal
dependence		dependence
fitting		fitting
functions		functions
tests		tests
__init__.py		__init__.py
readme.md		readme.md
requirements.txt		requirements.txt

ArnoVel/structure-identification

Folders and files

Latest commit

History

Repository files navigation

Probing the Structure of Bivariate Distributions

Dependence - Independence Measures

HSIC

MMD

C2ST

Bivariate Causal Algorithms

SLOPE

Distribution fittings

Flexible Gaussian Mixtures

Experiments and Visualisations

About

Topics

Resources

Stars

Watchers

Forks

Languages