Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mutual information between different high dimensional continuous signal #10

Open
sankar-mukherjee opened this issue Feb 14, 2019 · 7 comments

Comments

@sankar-mukherjee
Copy link

sankar-mukherjee commented Feb 14, 2019

Hello Prof. Greg Ver Steeg,

I want to compute MI between two high dimensional continues time varying signal. their dimension are 39 and 300. It seems like this toolbox is not suitable for that. do you know if there is any easy way to measure the MI in this situation?

@gregversteeg
Copy link
Owner

Those dimensionalities are a little high for the nearest neighbor based estimators. It might work if the signal is intrinsically low dimensional.
If not, there have been some recent ideas for high-d mutual information estimation with neural networks, like MINE. Unfortunately, I don't think they released their code, and a student in my group found scenarios where that type of analysis breaks.
Are you more interested in upper or lower bounds for MI? If you only want a lower bound, you could always do some dimensionality reduction first. (Then lower bound follows from data processing inequality.)

@sankar-mukherjee
Copy link
Author

Sorry, but I have not much knowledge in what is upper or lower bounds for MI? but following your dimensionality reduction, if i apply PCA on 300 dim signal but first components could not capture the most of the variences.

download

do you think its ok to use mi after this?

also i have applied the MINE (code found in github) in my data, yes for higher dimention (for example over 10) its not stable. although i am not sure the code is proper though.

@gregversteeg
Copy link
Owner

gregversteeg commented Feb 28, 2019

I've heard from others that MINE is actually not that stable, so it may not just be you.

You can always do dimensionality reduction to get a LOWER bound on Mutual information. So if you take the first K components and then apply NPEET, that can be interpreted as an estimator for a lower bound on mutual information. It's hard to tell how good the lower bound will be... but you could try using K=10 or 20 components and see how stable your estimate looks. If K is too small, you will probably lose a lot of information. If K is too large, the NPEET estimates will be unstable.

What would be cool is to make a plot of the mutual information estimate using different K, with error bars constructed using the shuffle/permutation test. You should see the MI estimate go up with K, but then at some point the error bars will become large. Hopefully you can find a good middle ground (i.e., a K which has large mutual information estimates, but small error bars).

@gregversteeg
Copy link
Owner

Some of the issues with MINE are discussed in this paper:
http://bayesiandeeplearning.org/2018/papers/136.pdf
They also suggest a way to get more stable estimates, but I didn't see code posted.

@sankar-mukherjee
Copy link
Author

I have tried with increasing pca components applied to my 300 dim signal. then i computed MI between reduced signal and my other 21 (y) dimention signal. I have used this
ee.shuffle_test(ee.mi, y,pca.components_.T, z=False, ci=0.95, ns=1000)
function to compute the conf. interval. here i dont see error bar increasing rather MI peaks at certain components and then goes down.

download

from this plot should i choose K=52 is my optimum value?

@gregversteeg
Copy link
Owner

Interesting plot, thanks! I think it does make sense to pick K=52 as optimal.

The way the decrease looked on the right side surprised me. My prediction was that error bars would get large, and I didn't predict the decrease. I looked around for some literature on this but didn't find anything. A little discussion about what (might!) be going on.
(1) The increase on the left side is easy to understand. You can imagine that we have a markov chain, f(X) - X - Y. f(X) is any function of X, including dimensionality reduction from PCA. I(f(X);Y) <= I(X;Y) by the data processing inequality. Additionally, you could imagine taking a longer chain with all PCA components, or just the top k like this PCA_1(X) - ... - PCA_1:k(X) - ... - PCA_1:n(X) - X - Y. Then we can see from the data processing inequality that the MI should be non-decreasing as we increase k.
(2) However, contrary to the statement in (1), the MI does decrease (after k=52 in your picture)! What is happening? Well, we know that the true MI is not actually decreasing, due to the data processing inequality. Therefore what must be happening is that the estimator is under-estimating MI.
What I was looking for in the literature is a discussion about whether this under-estimation gets systematically worse with high-dimension. I didn't find it, but we had some bounds about how estimators under-estimate mutual information here:
http://proceedings.mlr.press/v38/gao15.html
But the dependence on dimension in our paper does not show a decrease. This suggests that our bound can be tightened. There have been some follow-up works, maybe they have tighter bounds (e.g. https://openreview.net/forum?id=BkedwoC5t7, but I didn't see a discussion of dimension in my first pass.)
Anyway, it's clear that the drop on the right is due to under-estimation of MI, so it makes sense to pick the highest value before this effect becomes larger than the increase we expect from the data processing inequality. A formal study of this under-estimation in high-dimensions might be a good research topic.

@gregversteeg
Copy link
Owner

One other thought:

It's probably hard to come up with results that say that MI is systematically under-estimated in high dimensions because it is always possible that the data still lie approximately on a low dimensional manifold (in which case we'd still expect knn estimators to work).

However, when you do PCA, all the components are normalized so that as you add more top components the effective dimensionality stays large, even if that isn't true for the original data.
You can look at the PCA eigenvalues to see how large the variation for the k-th component is.
One idea would be to multiply the k-th PCA component by its corresponding eigenvalue so that it's scale reflects its variation in the data. I predict you will get a more monotonic looking curve if you do this. (And multiplying by a constant doesn't affect the true MI. This is just to make things easier for the estimator.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants